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Abstract 


We collected file system content data from 857 desktop 
computers at Microsoft over a span of 4 weeks. We 
analyzed the data to determine the relative efficacy of 
data deduplication, particularly considering whole-file 
versus block-level elimination of redundancy. We 
found that whole-file deduplication achieves about 
three quarters of the space savings of the most aggres- 
sive block-level deduplication for storage of live file 
systems, and 87% of the savings for backup images. 
We also studied file fragmentation finding that it is not 
prevalent, and updated prior file system metadata stud- 
ies, finding that the distribution of file sizes continues 
to skew toward very large unstructured files. 


1 Introduction 


File systems often contain redundant copies of infor- 
mation: identical files or sub-file regions, possibly 
stored on a single host, on a shared storage cluster, or 
backed-up to secondary storage. Deduplicating storage 
systems take advantage of this redundancy to reduce the 
underlying space needed to contain the file systems (or 
backup images thereof). Deduplication can work at 
either the sub-file [10, 31] or whole-file [5] level. More 
fine-grained deduplication creates more opportunities 
for space savings, but necessarily reduces the sequential 
layout of some files, which may have significant per- 
formance impacts when hard disks are used for storage 
(and in some cases [33] necessitates complicated tech- 
niques to improve performance). Alternatively, whole- 
file deduplication is simpler and eliminates file- 
fragmentation concerns, though at the cost of some oth- 
erwise reclaimable storage. 


Because the disk technology trend is toward improved 
sequential bandwidth and reduced per-byte cost with 
little or no improvement in random access speed, it’s 
not clear that trading away sequentiality for space sav- 
ings makes sense, at least in primary storage. 


In order to evaluate the tradeoff in space savings be- 
tween whole-file and block-based deduplication, we 


conducted a large-scale study of file system contents on 
desktop Windows machines at Microsoft. Our study 
consists of 857 file systems spanning 162 terabytes of 
disk over 4 weeks. It includes results from a broad 
cross-section of employees, including software devel- 
opers, testers, management, sales & marketing, tech- 
nical support, documentation writers and legal staff. 
We find that while block-based deduplication of our 
dataset can lower storage consumption to as little as 
32% of its original requirements, nearly three quarters 
of the improvement observed could be captured through 
whole-file deduplication and sparseness. For four 
weeks of full backups, whole file deduplication (where 
a new backup image contains a reference to a duplicate 
file in an old backup) achieves 87% of the savings of 
block-based. We also explore the parameter space for 
deduplication systems, and quantify the relative bene- 
fits of sparse file support. Our study of file content is 
larger and more detailed than any previously published 
effort, which promises to inform the design of space- 
efficient storage systems. 


In addition, we have conducted a study of metadata and 
data layout, as the last similar study [1] is now 4 years 
old. We find that the previously observed trend toward 
storage being consumed by files of increasing size con- 
tinues unabated; half of all bytes are in files larger than 
30MB (this figure was 2MB in 2000). Complicating 
matters, these files are in opaque unstructured formats 
with complicated access patterns. At the same time 
there are increasingly many small files in an increasing- 
ly complex file system tree. 


Contrary to previous work [28], we find that file-level 
fragmentation is not widespread, presumably due to 
regularly scheduled background defragmenting in Win- 
dows [17] and the finding that a large portion of files 
are rarely modified (see Section 4.4.2). For more than a 
decade, file system designers have been warned against 
measuring only fresh file system installations, since 
aged systems can have a significantly different perfor- 
mance profile [28]. Our results show that this concern 
may no longer be relevant, at least to the extent that the 
aging produces file-level fragmentation. Ninety-six 
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percent of files observed are entirely linear in the block 
address space. To our knowledge, this is the first large 
scale study of disk fragmentation in the wild. 


We describe in detail the novel analysis optimizations 
necessitated by the size of this data set. 


2 Methodology 


Potential participants were selected randomly from Mi- 
crosoft employees. Each was contacted with an offer to 
install a file system scanner on their work computer(s) 
in exchange for a chance to win a prize. The scanner 
ran autonomously during off hours once per week from 
September 18 — October 16, 2009. We contacted 10,500 
people in this manner to reach the target study size of 
about 1000 users. This represents a participation rate of 
roughly 10%, which is smaller than the rates of 22% in 
similar prior studies [1, 9]. Anecdotally, many potential 
participants declined explicitly because the scanning 
process was quite invasive. 


2.1 File system Scanner 

The scanner first took a consistent snapshot of fixed 
device (non-removable) file systems with the Volume 
Shadow Copy Service (VSS) [20]. VSS snapshots are 
both file system and application consistent'. It then 
recorded metadata about the file system itself, including 
age, capacity, and space utilization. The scanner next 
processed each file in the snapshot, writing records to a 
log. It recorded Windows file metadata [19], including 
path, file name and extension, time stamps, and the file 
attribute flags. It recorded any retrieval and allocation 
pointers, which describe fragmentation and sparseness 
respectively. It also recorded information about the 
whole system, including the computer’s hardware and 
software configuration and the time at which the 
defragmentation tool was last run, which is available in 
the Windows registry. We took care to exclude from 
study the pagefile, hibernation file, the scanner itself, 
and the VSS snapshots it created. 


During the scan, we recorded the contents of each file 
first by breaking the file into chunks using each of two 
chunking algorithms (fixed block and Rabin finger- 
printing [25]) with each of 4 chunk size settings (8K- 
64K in powers of two) and then computed and saved 
hashes of each chunk. We found whole file duplicates 
in post-processing by identifying files in which all 





' “Application consistent” means that VSS-aware appli- 
cations have an opportunity to save their state cleanly 
before the snapshot is taken. 
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chunks matched. In addition to reading the ordinary 
contents of files we also collected a separate set of 
scans where the files were read using the Win32 Back- 
upRead API [16], which includes metadata about the 
file and would likely be the format used to store file 
system backups. 


We used salted MDS [26] as our hash algorithm, but 
truncated the result to 48 bits in order to reduce the size 
of the data set. The Rabin-chunked data with an 8K 
target chunk size had the largest number of unique 
hashes, somewhat more than 768M. We expect that 
about two thousand of those (0.0003%) are false 
matches due to the truncated hash. 


Another process copied the log files to our server at 
midnight on a random night of the week to help smooth 
the considerable network traffic. Nevertheless, the cop- 
ying process resulted in the loss of some of the scans. 
Because the scanner placed the results for each of the 
32 parameter settings into separate files and the copying 
process worked at the file level, for some file systems 
we have results for some, but not all of the parameter 
settings. In particular, larger scan files tended to be par- 
tially copied more frequently than smaller ones, which 
may result in a bias in our data where larger file sys- 
tems are more likely to be excluded. Similarly, scans 
with a smaller chunk size parameter resulted in larger 
size scan files and so were lost at a higher rate. 


2.2 Post Processing 

At the completion of the study the resulting data set was 
4.12 terabytes compressed, which would have required 
considerable machine time to import into a database. As 
an optimization, we observed that the actual value of 
any unique hash (i.e., hashes of content that was not 
duplicated) was not useful to our analyses. 


To find these unique hashes quickly we used a novel 2- 
pass algorithm. During the first pass we created a 2 GB 
Bloom filter [4] of each hash observed. During this 
pass, if we tried to insert a value that was already in the 
Bloom filter, we inserted it into a second Bloom filter 
of equal size. We then made a second pass through the 
logs, comparing each hash to the second Bloom filter 
only. If it was not found in the second filter, we were 
certain that the hash had been seen exactly once and 
could be omitted from the database. If it was in the fil- 
ter, we concluded that either the hash value had been 
seen more than once, or that its entry in the filter was a 
collision. We recorded all of these values to the data- 
base. Thus this algorithm was sound, in that it did not 
impact the results by rejecting any duplicate hashes. 
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However it was not complete despite being very effec- 
tive, in that some non-duplicate hashes may have been 
added to the database even though they were not useful 
in the analysis. The inclusion of these hashes did not 
affect our results, as the later processing ignored them. 


2.3 Biases and Sources of Error 

The use of Windows workstations in this study is bene- 
ficial in that the results can be compared to those of 
similar studies [1, 9]. However, as in all data sets, this 
choice may introduce biases towards certain types of 
activities or data. For example, corporate policies sur- 
rounding the use of external software and libraries 
could have impacted our results. 


As discussed above, the data retrieved from machines 
under observation was large and expensive to generate 
and so resulted in network timeouts at our server or 
aborted scans on the client side. While we took 
measures to limit these effects, nevertheless some 
amount of data never made it to the server, and more 
had to be discarded as incomplete records. Our use of 
VSS makes it possible for a user to selectively remove 
some portions of their file system from our study. 


We discovered a rare concurrency bug in the scanning 
tool affecting 0.003% of files. While this likely did not 
affect results, we removed all files with this artifact. 


Our scanner was unable to read the contents of Win- 
dows system restore points, though it could see the file 
metadata. We excluded these files from the deduplica- 
tion analyses, but included them in the metadata anal- 
yses. 


3 Redundancy in File Contents 


Despite the significant declines in storage costs per GB, 
many organizations have seen dramatic increases in 
total storage system costs [21]. There is considerable 
interest in reducing these costs, which has given rise to 
deduplication techniques, both in the academic com- 
munity [6] and as commercial offerings [7, 10, 14, 33]. 
Initially, the interest in deduplication has centered on its 
use in “embarrassingly compressible” scenarios, such 
as regular full backups [3, 8] or virtual desktops [6, 13]. 
However, some have also suggested that deduplication 
be used more widely on general purpose data sets [31]. 


The rest of this section seeks to provide a well-founded 
measure of duplication rates and compare the efficacy 
of different parameters and methods of deduplication. 
In Section 3.1 we provide a brief summary of dedupli- 


cation, and in Section 3.2 we discuss the performance 
challenges deduplication introduces. In Section 3.3 we 
share observed duplication rates across a set of work- 
stations. Finally, Section 3.4 measures duplication in 
the more conventional backup scenario. 


3.1 Background on Deduplication 
Deduplication systems decrease storage consumption 
by identifying distinct chunks of data with identical 
content. They then store a single copy of the chunk 
along with metadata about how to reconstruct the origi- 
nal files from the chunks. 


Chunks may be of a predefined size and alignment, but 
are more commonly of variable size determined by the 
content itself. The canonical algorithm for variable- 
sized content-defined blocks is Rabin Fingerprints [25]. 
By deciding chunk boundaries based on content, files 
that contain identical content that is shifted (say be- 
cause of insertions or deletions) will still result in 
(some) identical chunks. Rabin-based algorithms are 
typically configured with a minimum and maximum 
chunk size, as well as an expected chunk size. In all 
our experiments, we set the minimum and maximum 
parameters to 4K and 128K, respectively while we var- 
ied the expected chunk size from 8K to 64K by powers- 
of-two. 


3.2. The Performance Impacts of 


Deduplication 

Managing the overheads introduced by a deduplication 
system is challenging. Naively, each chunk’s finger- 
print needs to be compared to that of all other chunks. 
While techniques such as caches and Bloom filters can 
mitigate overheads, the performance of deduplication 
systems remains a topic of research interest [32]. The 
I/O system also poses a performance challenge. In addi- 
tion to the layer of indirection required by deduplica- 
tion, deduplication has the effect of de-linearizing data 
placement, which is at odds with many data placement 
optimizations, particularly on hard-disk based storage 
where the cost for non-sequential access can be orders 
of magnitude greater than sequential. 


Other more established techniques to reduce storage 
consumption are simpler and have smaller performance 
impact. Sparse file support exists in many file systems 
including NTFS [23], XFS [29], and ext4 [15] and is 
relatively simple to implement. In a sparse file a chunk 
of zeros is stored notationally by marking its existence 
in the metadata, removing the need to physically store 
it. Whole file deduplication systems, such as the Win- 
dows SIS facility [5] operate by finding entire files that 
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Figure 1: Deduplication vs. Chunk Size for Various 
Algorithms 
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Figure 3: CDF of Bytes by Containing File Size for 
Whole File Duplicates and All Files 
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dll 20% 521K 
lib 11% 1080K 
pdb 11% 2M 
<none> 7% 277K 
exe 6% 572K 
cab 4% 4M 
msp 3% 15M 
msi 3% 5M 
iso 2% 436M 
<a guid> 1% 604K 
hxs 1% 2M 
xml 1% 49K 
jpg 1% 147K 
wim 1% 16M 

h 1% 23K 

















Table 1: Whole File Duplicates by Extension 
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Figure 4: CDF of File System Capacity 












































Extension | Fixed % | Extension | Rabin % 
vhd 3.6% vhd 5.2% 
pch 0.5% lib 1.6% 
dll 0.5% obj 0.8% 
pdb 0.4% pdb 0.6% 
lib 0.4% pch 0.6% 
wma 0.3% iso 0.6% 
pst 0.3% dll 0.6% 
<none> 0.3% avhd 0.5% 
avhd 0.3% wma 0.4% 
mp3 0.3% wim 0.4% 
pds 0.2% zip 0.3% 
iso 0.2% pst 0.3% 

















Table 2: Non-whole File, Non-Zero Duplicate 
Data as a Fraction of File System Size by File 
Extension, 8K Fixed and Rabin Chunking 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


USENIX Association 


USENIX Association 


are duplicates and replacing them by copy-on-write 
links. Although SIS does not reduce storage consump- 
tion as much as a modern deduplication system, it 
avoids file allocation concerns and is far less computa- 
tionally expensive than more exhaustive deduplication. 


3.3 Deduplication in Primary Storage 
Our data set includes hashes of data in both variable 
and fixed size chunks, and of varying sizes. We chose a 
single week (September 18, 2009) from this dataset and 
compared the size of all unique chunks to the total con- 
sumption observed. We had two parameters that we 
could vary: the deduplication algorithm/parameters and 
the set of file systems (called the deduplication domain) 
within which we found duplicates; duplicates in sepa- 
rate domains were considered to be unique contents. 


The set of file systems included corresponds to the size 
of the file server(s) holding the machines’ file systems. 
A value of 1 indicates deduplication running inde- 
pendently on each desktop machine. “Whole Set” 
means that all 857 file systems are stored together in a 
single deduplication domain. We considered all power- 
of-two domain sizes between | and 857. For domain 
sizes other than 1 or 857, we had to choose which file 
systems to include together into particular domains and 
which to exclude when the number of file systems 
didn’t divide evenly by the size of the domain. We did 
this by using a cryptographically secure random num- 
ber generator. We generated sets for each domain size 
ten times and report the mean of the ten runs. The 
standard deviation of the results was less than 2% for 
each of the data points, so we don’t believe that we 
would have gained much more precision by running 
more trials”. 


Rather than presenting a three dimensional graph vary- 
ing both parameters, we show two slices through the 
surface. In both cases, the y-axis shows the deduplicat- 
ed file system size as a percentage of the original file 
system size. Figure 1 shows the effect of the chunk size 
parameter for the fixed and Rabin-chunked algorithms, 
and also for the whole file algorithm (which doesn’t 
depend on chunk size, and so varies only slightly due to 
differences in the number of zeroes found and due to 
variations in which file systems scans copied properly; 
see Section 3.2). This graph assumes that all file sys- 
tems are in a single deduplication domain; the shape of 
the curve is similar for smaller domains, through the 
space savings are reduced. 





> As it was, it took about 8 machine-months to do the 
analyses. 


Figure 2 shows the effect changing the size of the 
deduplication domains. Space reclaimed improves 
roughly linearly in the log of the number of file systems 
in a domain. Comparing single file systems to the 
whole set, the effect of grouping file systems together is 
larger than that from the choice of chunking algorithm 
or chunk size, or even of switching from whole file 
chunking to block-based. 


The most aggressive chunking algorithm (8K Rabin) 
reclaimed between 18% and 20% more of the total file 
size than did whole file deduplication. This offers weak 
support for block-level deduplication in primary stor- 
age. The 8K fixed block algorithm reclaimed between 
10% and 11% more space than whole file. This ca- 
pacity savings represents a small gain compared to the 
performance and complexity of introducing advanced 
deduplication features, especially ones with dynamical- 
ly variable block sizes like Rabin fingerprinting. 


Table 1 shows the top 15 file extensions contributing to 
duplicate content for whole file duplicates, the percent- 
age of duplicate space attributed to files of that type, 
and the mean file size for each type. It was calculated 
using all of the file systems in a single deduplication 
domain. The extension marked <a guid> is a particular 
globally unique ID that’s associated with a widely dis- 
tributed software patch. This table shows that the sav- 
ings due to whole file duplicates are concentrated in 
files containing program binaries: dll, lib, pdb, exe, cab, 
msp, and msi together make up 58% of the saved space. 


Figure 3 shows the CDF of the bytes reclaimed by 
whole file deduplication and the CDF of all bytes, both 
by containing file size. It shows that duplicate bytes 
tend to be in smaller files than bytes in general. Anoth- 
er way of looking at this is that the very large file types 
(virtual hard disks, database stores, etc.) tend not to 
have whole-file copies. This is confirmed by Table 1. 


Table 2 shows the amount of duplicate content not in 
files with whole-file duplicates by file extension as a 
fraction of the total file system content. It considers the 
whole set of file systems as a single deduplication do- 
main, and presents results with an 8K block size using 
both fixed and Rabin chunking. For both algorithms, 
by far the largest source of duplicate data is VHD (vir- 
tual hard drive) files. Because these files are essentially 
disk images, it’s not surprising both that they contain 
duplicate data and also that they rarely have whole-file 
duplicates. The next four file types are all compiler 
outputs. We speculate that they generate block-aligned 
duplication because they have header fields that con- 
tain, for example, timestamps but that their contents is 
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otherwise deterministic in the code being compiled. 
Rabin chunking may find blocks of code (or symbols) 
that move somewhat in the file due to code changes that 
affect the length of previous parts of the file. 


3.4 Deduplication in Backup Storage 
Much of the literature on deduplication to date has re- 
lied on workloads consisting of daily full backups [32, 
33]. Certainly these workloads represent the most at- 
tractive scenario for deduplication, because the content 
of file systems does not change rapidly. Our data set 
did not allow us to consider daily backups, so we con- 
sidered only weekly ones. 


With frequent and persistent backups, the size of histor- 
ical data will quickly out-pace that of the running sys- 
tem. Furthermore, performance in secondary storage is 
less critical than in that of primary, so the reduced se- 
quentiality of a block-level deduplicated store is of 
lesser concern. We considered the 483 file systems for 
which four continuous weeks of complete scans were 
available, starting with September 18, 2009, the week 
used for the rest of the analyses. 


Our backup analysis considers each file system as a 
separate deduplication domain. We expect that com- 
bining multiple backups into larger domains would 
have a similar effect to doing the same thing for prima- 
ry storage, but we did not run the analysis due to re- 
source constraints. 


In practice, some backup solutions are incremental (or 
differential), storing deltas between files, while others 
use full backups. Often, highly reliable backup policies 
use a mix of both, performing frequent incremental 
backups, with occasional full backups to limit the po- 
tential for loss due to corruption. Thus, the meaning of 
whole-file deduplication in a backup store is not imme- 
diately obvious. We ran the analysis as if the backups 
were stored as simple copies of the original file sys- 
tems, except that the contents of the files was the output 
from the Win32 BackupRead [16] call, which includes 
some file metadata along with the data. For our pur- 
poses, imagine that the backup format finds whole file 
duplicates and stores pointers to them in the backup 
file. This would result in a garbage collection problem 
for the backup files when they’re deleted, but the details 
of that are beyond the scope of our study and are likely 
to be simpler than a block-level deduplicating store. 


Using the Rabin chunking algorithm with an 8K ex- 
pected chunk size, block-level deduplication reclaimed 
83% of the total space. Whole file deduplication, on 
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the other hand, yielded 72%. These numbers, of 
course, are highly sensitive to the number of weeks of 
scans used in the study; it’s no accident that the results 
were around *% of the space being claimed when there 
were four weeks of backups. However, one should not 
assume that because 72% of the space was reclaimed by 
whole file deduplication that only 3% of the bytes were 
in files that changed. The amount of change was larger 
than that, but the deduplicator found redundancy within 
a week as well and the two effects offset. 


4 Metadata 


This paper is the 3"! major metadata study of Windows 
desktop computers [1, 9]. This provides a unique per- 
spective in the published literature, as we are able to 
track more than a decade of trends file and file system 
metadata. On a number of graphs, we took the lines 
from 2000 and 2004 from an earlier study [1] and plot- 
ted them on our graphs to make comparisons easier. 
Only the 2009 data is novel to this paper. Some graphs 
contain both CDF and histogram lines. In these graphs, 
the CDF should be read from the left-hand y-scale and 
the histogram from the right. We present much of our 
data in the form of cumulative density function plots. 
These plots make it easy to determine the distributions, 
but do not easily show the mean. Where appropriate, 
we list the mean of the distribution in the text. 
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4.1 Physical Machines 

Our data set contains scans of 857 file systems hosted 
on 597 computers. 59% were running Windows 7, 20% 
Windows Vista, 18% Windows Server 2008 and 3% 
Windows Server 2003. They had a mean and median 
physical RAM of about 4GB, and ranged from 1-10GB. 
5% had 8 processors, 44% 4, 49% 2 and 3% were 
uniprocessors’, 


4.2 File systems 

We analyze file systems in terms of their age, capacity, 
fullness, and the number of files and directories. We 
present our results, interpretations, and recommenda- 
tions to designers in this section. 


4.2.1 Capacity 

The mean file system capacity is 194GB. Figure 4 
shows a cumulative density function of the capacities of 
the file systems in the study. It shows a significant in- 
crease in the range of commonly observed file system 
sizes and the emergence of a noticeable step function in 
the capacities. Both of these trends follow from the 
approximately annual doubling of physical drive capac- 
ity. We expect that this file system capacity range will 
continue to increase, anchored by smaller SSDs on the 
left, and continuing step wise towards larger magnetic 
devices on the right. This will either force file systems 
to perform acceptably on an increasingly wide range of 
media, or push users towards more highly tuned special 
purpose file systems. 


4.2.2 Utilization 

Although capacity has increased by nearly two orders 
of magnitude since 2000, utilization of capacity has 
dropped only slightly, as shown in Figure 5. Mean uti- 
lization is 43%, only somewhat less than the 53% found 
in 2000. No doubt this is the result of both users adapt- 
ing to their available space and hard drive manufactur- 
ers tracking the growth in data. The CDF shows a near- 
ly linear relationship, with 50% of users having drives 
no more than 40% full, 70% at less than 60% utiliza- 
tion, and 90% at less than 80%. Proposals to take ad- 
vantage of the unused capacity of file systems [2, 11] 
must be cautious that they only assume scaling of the 
magnitude of free space, not the relative portion of the 
disk that is free. System designers also must take care 
not to ignore the significant contingent (15%) of all 
users with disks more than 75% full. 





> The total is 101% due to rounding error. 
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4.3 File system Namespace 

Recently, Murphy and Seltzer have questioned the mer- 
its of hierarchical file systems [22], based partly on the 
challenge of managing increasing data sizes. Our analy- 
sis shows many ways in which namespaces have be- 
come more complex. We have observed more files, 
more directories, and an increase in namespace depth. 
While a rigorous comparison of namespace organiza- 
tion structures is beyond the scope of this paper, the 
increase in namespace complexity does lend evidence 
to the argument that change is needed in file system 
organization. Both file and directory counts show a 
significant increase from previous years in Figures 6 
and 7 respectively, with a mean of 225K files and 36K 
directories per file system. 


The CDF in Figure 8 shows the number of files per 
directory. While the change is small, it is clear — even 
as users in 2009 have more files, they have fewer files 
per directory, with a mean of 6.25 files per directory. 


Figure 9 shows the distribution of subdirectories per 
directory. Since the mean subdirectories per directory 
is necessarily one’, the fact that the distribution is more 
skewed toward smaller sizes indicates that the directory 
structure is deeper with a smaller branching factor. 
However, the exact interpretation of this result warrants 
further study. It is not clear if this depth represents a 
conscious organization choice, is the result of users 
being unable effectively to organize their hierarchical 
data or is simply due to the design of the software that 
populates the tree. Figure 10 shows the histogram and 
CDF of files by directory depth for the 2009 data; simi- 
lar results were not published in the earlier studies. 


The histogram in Figure 11 shows how the utilization 
of storage is related to namespace depth. There is a 
steep decline in the number of bytes stored more than 5 
levels deep in the tree. However, as we will see in Sec- 
tion 4.4, this does not mean the deeply nested files are 
unimportant. Comparing it with Figure 10 shows that 
files higher in the directory tree are larger than those 
deeper. 


4.4 Files 


Our analysis of files in the dataset shows distinct clas- 
ses of files emerging. The frequently observed fact that 
most files are small and most bytes are in large files has 
intensified. The mean file size is now 318K, about three 
times what it was in 2000. Files can be classified by 





* Ignoring that the root directory isn’t a member of any 
directory. 
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their update time as well. A large class of files is writ- 
ten only once (perhaps at install time). 


4.4.1 File Size 

In one respect, file sizes have not changed at all. The 
median file size remains 4K (a result that has been re- 
markably consistent since at least 1981 [27]), and the 
distribution of file sizes has changed very little since 
2000. Figure 12 shows that the proportion of these 
small files has in fact increased with fewer files both 
somewhat larger and somewhat smaller than 4K. There 
is also an increase in larger files between 512K and 
SMB. 


Figure 13 shows a histogram of the total number of 
bytes stored in files of various sizes. A trend towards 
bi-modality has continued, as predicted in 2007 [1], 
though a third mode above 16G is now appearing. Fig- 
ure 14 shows that more capacity usage has shifted to the 
larger files, even though there are still few such files in 
the system. This suggests that optimizing for large files 
will be increasingly important. 


Viewed a different way, we can see that trends towards 
very large files being the principle consumers of storage 
have continued smoothly. As discussed in Section 4.5, 
this is a particular challenge because large files like 
VHDs have complex internal structures with difficult to 
predict access patterns. Semantic knowledge to exploit 
these structures, or file system interfaces that explicitly 
support them may be required to optimize for this class 
of data. 


4.4.2 File Times 

File modifications time stamps are usually updated 
when a file is written. Figure 15 shows a histogram and 
CDF of time since file modification with log scaling on 
the x-axis’. The same data with 1 month bins is plotted 
in Figure 16. Most files are modified between one 
month and a year ago, but about 20% are modified 
within the last month. 





° Unlike the other combined histogram/CDF graphs, 
this one has both lines using the left y-axis due to a bug 
in the graphing package. 
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30% Figure 17 relates file modification time to the age of the 
20% a file system. The x-axis shows the time since a file was 
10% 2% last modified divided by the time since the file system 
0% 0% was formatted. This range exceeds 100% because 
0 12 24 36 48 some files were created prior to installation and were 
Time since last modification (months), one month subsequently copied to the file system, preserving their 
ping modification time. The spike around 100% mostly con- 
sists of files that were modified during the system in- 
mn CDF steers Histogram stallation. The area between 0% and 100% shows a 
Figure 16: Time Since Last File Modification relatively smooth decline, with a slight inflection 
around 40%. 
100% 5% 
90% 5% NTFS has always supported a last access time field for 
80% 4% files. We omit any analysis because updates to it are 
70% 4% disabled by default as of Windows Vista [18]. 
60% 3% 
50% an 4.5 Extensions 
ae ai Figure 18 shows only modest change in the extensions 
30% 2% for the most popular files. However, the extension 
20% 1% space continues to grow. The ten most popular files 
10% 1% extensions now account for less than 45% of the total 
0% 0% files compared with over 50% in 2000. 
0% 20% 40% 60% 80% 100% 120% 
Time Since File Modification/FS Age Figure 19 shows the top storage consumers by file ex- 
(%), 1% bins tension. Several changes are apparent here. First, there 
is a significant increase in storage consumed by files 
CDF +++*** Histogram with no extension, which have moved from 10" place 





in all previous years to be the largest class of files to- 

Figure 17: Time Since Last File Modification as a day, replacing DLLs. VHD and ISO files are virtual 
Fraction of File System Age. disks and images for optical media. They have in- 

creased in relative size, but not as quickly as LIB files. 

Finally, the portion of storage space consumed by the 
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top extensions has increased by nearly 15% from previ- 
ous years. 


5 On-disk Layout 


The behavior and characteristics of magnetic disks con- 
tinue to be a dominant concern in storage system opti- 
mization. It has been shown that file system perfor- 
mance changes over time, largely due to fragmentation 
[28]. While we have no doubt that the findings were 
true in 1997, our research suggests that this observation 
no longer holds in practice. 


We measure fragmentation in our data set by recording 
the files’ retrieval pointers, which point to NFTS’s data 
blocks. Retrieval pointers that are non-linear indicate a 
fragmented file. We find such fragmentation to be rare, 
occurring in only 4% of files. This lack of fragmenta- 
tion in Windows desktops is due to the fact that a large 
fraction of files are not written after they are created 
and due to the defragmenter, which runs weekly by 
default®. However, among files containing at least one 
fragment, fragments are relatively common. In fact, 
25% of fragments are in files containing more than 170 
fragments. The most highly fragmented files appear to 
be log files, which (if managed naively) may create a 
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Figure 19: Bytes by File Extension 





® This is true for all of our scans other than the 17 that 
came from machines running Windows Server 2003. 


new fragment for each appending write. 


6 Related Work 


Studies of live deployed system behavior and usage 
have long been a key component of storage systems 
research. Workload studies [30] are helpful in deter- 
mining what file systems do ina given slice of time, but 
provide little guidance as to the long term contents of 
files or file systems. Prior file system content studies 
[1, 9] have considered collections of machines similar 
to those observed here. The most recent such study 
uses 7 year old data, while data from the study before it 
is 11 years old, which we believe justifies the file sys- 
tem portion of this work. However, this research also 
captures relevant results that the previous work does 
not. 


Policroniades and Pratt [24] studied duplication rates 
using various chunking strategies on a dataset about 
0.1% of the size of ours, finding little whole-file dupli- 
cation and a modest difference between fixed-block and 
content-based chunking. Kulkarni ef al. [12] found 
combining compression, eliminating duplicate identi- 
cal-sized chunks and delta-encoding across multiple 
datasets to be effective. Their corpus was about 8GB. 


We are able to track file system fragmentation and data 
placement, which has not been analyzed recently [28] 
or at large scale. We are also able to track several 
forms of deduplication, which is an important area of 
current research. Prior work has used very selective 
data sets usually focusing either on frequent full back- 
ups [3, 8], virtual machine images [6, 13], or simulation 
[10]. In the former case, data not modified between 
backups can be trivially deduplicated, and in the latter 
disk images start from a known identical storage, and 
diverge slowly over time. In terms of size, only the 
DataDomain [33] study rivals ours. It is less than half 
the size presented here and was for a highly self- 
selective group. Thus, we not only consider a more 
general, but also a larger dataset than comparable stud- 
ies. Moreover, we include a comparison to whole-file 
deduplication, which has been missing in much of the 
deduplication research to date. Whole file deduplica- 
tion is an obvious alternative to block-based deduplica- 
tion because it is light-weight and as we have shown, 
nearly as effective at reclaiming space. 
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7 Conclusion 


We studied file system data, metadata, and layout on 
nearly one thousand Windows file systems in a com- 
mercial environment. This new dataset contains 
metadata records of interest to file system designers, 
data content findings that will help create space effi- 
ciency techniques, and data layout information useful in 
the evaluation and optimization of storage systems. 


We find that whole-file deduplication together with 
sparseness is a highly efficient means of lowering stor- 
age consumption, even in a backup scenario. It ap- 
proaches the effectiveness of conventional deduplica- 
tion at a much lower cost in performance and complexi- 
ty. The environment we _ studied, despite being 
homogeneous, shows a large diversity in file system 
and file sizes. These challenges, the increase in un- 
structured files, and an ever-deepening and more popu- 
lated namespace pose significant challenge for future 
file system designs. However, at least one problem — 
that of file fragmentation, appears to be solved, provid- 
ed that a machine has periods of inactivity in which 
defragmentation can be run. 
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Abstract 


As data have been growing rapidly in data centers, 
deduplication storage systems continuously face chal- 
lenges in providing the corresponding throughputs and 
capacities necessary to move backup data within backup 
and recovery window times. One approach is to build a 
cluster deduplication storage system with multiple dedu- 
plication storage system nodes. The goal is to achieve 
scalable throughput and capacity using extremely high- 
throughput (e.g. 1.5 GB/s) nodes, with a minimal loss 
of compression ratio. The key technical issue is to route 
data intelligently at an appropriate granularity. 

We present a cluster-based deduplication system that 
can deduplicate with high throughput, support dedupli- 
cation ratios comparable to that of a single system, and 
maintain a low variation in the storage utilization of in- 
dividual nodes. In experiments with dozens of nodes, 
we examine tradeoffs between stateless data routing ap- 
proaches with low overhead and stateful approaches that 
have higher overhead but avoid imbalances that can 
adversely affect deduplication effectiveness for some 
datasets in large clusters. The stateless approach has 
been deployed in a two-node commercial system that 
achieves 3 GB/s for multi-stream deduplication through- 
put and currently scales to 5.6 PB of storage (assuming 
20X total compression). 


1 Introduction 


For business reasons and regulatory requirements [14, 
29], data centers are required to backup and recover their 
exponentially increasing amounts of data [15] to and 
from backup storage within relatively small windows of 
time; typically a small number of hours. Furthermore, 
many copies of the data must be retained for potentially 
long periods, from weeks to years. Typically, backup 
software aggregates files into multi-gigabyte “tar” type 
files for storage. To minimize the cost of storing the 
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many backup copies of data, these files have tradition- 
ally been stored on tape. 

Deduplication is a technique for effectively reducing 
the storage requirement of backup data, making disk- 
based backup feasible. Deduplication replaces identi- 
cal regions of data (files or pieces of files) with refer- 
ences (such as a SHA-1 hash) to data already stored on 
disk [6, 20, 27, 36]. Several commercial storage systems 
exist that use some form of deduplication in combina- 
tion with compression (such as Lempel-Ziv [37]) to store 
hundreds of terabytes up to petabytes of original (logical) 
data [8, 9, 16, 25]. One state-of-the-art single-node dedu- 
plication system achieves 1.5 GB/s in-line deduplication 
throughput while storing petabytes of backup data with 
a combined data reduction ratio in the range of 10X to 
30X [10]. 

To meet increasing requirements, our goal is a backup 
storage system large enough to handle multiple pri- 
mary storage systems. An attractive approach is to 
build a deduplication cluster storage system with indi- 
vidual high-throughput nodes. Such a system should 
achieve scalable throughput, scalable capacity, and a 
cluster-wide data reduction ratio close to that of a single 
very large deduplication system. Clustering storage sys- 
tems [5, 21, 30] are a well-known technique to increase 
capacity, but adding deduplication nodes to such clusters 
suffer from two problems. First, it will fail to achieve 
high deduplication because such systems do not route 
based on data content. Second, tightly-coupled cluster 
file systems often do not exhibit linear performance scal- 
ability because of requirements for metadata synchro- 
nization or fine-granularity data sharing. 

Specialized deduplication clusters lend themselves to 
a loosely-coupled architecture because consistent use 
of content-aware data routing can leverage the sophis- 
ticated single-node caching mechanisms and data lay- 
outs [36] to achieve scalable throughput and capac- 
ity while maximizing data reduction. However, there 
is a tension between deduplication effectiveness and 
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throughput. On one hand, as chunk size decreases, dedu- 
plication rate increases, and single-node systems may 
deduplicate chunks as small as 4-8 KB! to achieve very 
high deduplication. On the other hand, with larger chunk 
sizes, high throughput is achieved because of stream 
and inter-file locality, and per-chunk memory overhead 
is minimized [18, 35]. High throughput deduplication 
with small chunk sizes is achieved on individual nodes 
using techniques that take advantage of cache locality to 
reduce I/O bottlenecks [20, 36]. For existing dedupli- 
cation clusters like HYDRAstor [8], though, relatively 
large chunk sizes (~64 KB) are used to maintain high 
throughput and fault tolerance at the cost of deduplica- 
tion. We would like to achieve scalable throughput and 
capacity with cluster-wide deduplication close to that of 
a State-of-the-art single node. 

In this paper, we propose a deduplicating cluster that 
addresses these issues by intelligently “striping” large 
files across a cluster: we create super-chunks that rep- 
resent consecutive smaller chunks of data, route super- 
chunks to nodes, and then perform deduplication at each 
node. We define data routing as the assignment of super- 
chunks to nodes. By routing data at the granularity of 
super-chunks rather than individual chunks, we maintain 
cache locality, reduce system overheads by batch pro- 
cessing, and exploit the deduplication characteristics of 
smaller chunks at each node. The challenges with rout- 
ing at the super-chunk level are, first, the risk of creating 
duplicates, since the fingerprint index is maintained in- 
dependently on each node; and second, the need for scal- 
able performance, since the system can overload a single 
node by routing too much data to it. 

We present two techniques to solve the data routing 
problem in building an efficient deduplication cluster, 
and we evaluate them through trace-driven simulation 
of collected backups up to 50 TB. First, we describe a 
stateless technique that routes based on only 64 bytes 
from the super-chunk. It is remarkably effective on typi- 
cal backup datasets, usually with only a ~10% decrease 
in deduplication for small clusters compared to a single 
node; for balanced workloads the gap is within ~ 10-20% 
even for clusters of 32-64 nodes. Second, we compare 
the stateless approach to a stateful technique that uses 
information about where previous chunks were routed. 
This achieves deduplication nearly as high as a single 
node and distributes data evenly among dozens of nodes, 
but it requires significant computation and either greater 
memory or communication overheads. We also explore 
a range of techniques for routing super-chunks that trade 
off memory and communication requirements, including 
varying how super-chunks are formed, how large they 
are on average, how they are assigned to nodes, and how 


Throughout the paper, references to chunks of a given size refer to 
chunks that are expected to average that size. 
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node imbalance is addressed. 

The rest of this paper is organized as follows. Sec- 
tion 2 describes our system architecture, then Section 3 
focuses on alternatives for super-chunk creation and 
routing. Section 4 presents our experimental method- 
ology, datasets, and simulator, and Section 5 shows the 
corresponding results. We briefly describe our product 
in Section 6. We discuss related work in Section 7, and 
conclusions and future work are presented in Section 8. 


2 System Overview 


This section presents our deduplication cluster design. 
We first review the architecture of our earlier storage sys- 
tem [36], which we use as a single-node building block. 
Because the design of the single-node system empha- 
sizes high throughput, any cluster architecture must be 
designed to support scalable performance. We then show 
the design of the deduplication cluster with stateless rout- 
ing, corresponding to our product (differences pertaining 
to stateful routing are presented later in the paper). 

We use the following criteria to govern our design de- 
cisions for the system architecture and choosing a routing 
strategy: 


e Throughput Our cluster should scale throughput 
with the number of nodes by maximizing parallel 
usage of high-throughput storage nodes. This im- 
plies that our architecture must optimize for cache 
locality, even with some penalty with respect to 
deduplication capacity—we will write duplicates 
across nodes for improved performance, within rea- 
son. 

e Capacity To maximize capacity, repeated patterns 
of data should be forwarded to storage nodes in 
a consistent fashion. Importantly, capacity usage 
should be balanced across nodes, because if a node 
fills up, the system must place new data on alternate 
nodes. Repeating the same data on multiple nodes 
leads to poor deduplication. 


The architecture of our single-node deduplication sys- 
tem is shown in Figure l(a). We assume the incom- 
ing data streams have been divided into chunks with a 
content-based chunking algorithm [4, 22], and a finger- 
print has been computed to uniquely identify each chunk. 
The main task of the system is to quickly determine 
whether each incoming chunk is new to the system and 
then to efficiently store new chunks. High-throughput 
fingerprint lookup is achieved by exploiting the dedupli- 
cation locality of backup datasets: in the same backup 
stream, chunks following a duplicate chunk are likely to 
be duplicates, too. 

To preserve locality, we use a technique based on 
Stream Informed Segment” Layout [36]: disk storage is 


Note that the term “segment” in the earlier paper means the same 
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(b) Dataflow of Deduplication Cluster 


Figure 1: Deduplication node architecture and cluster de- 
sign using individual nodes as building blocks. 


divided into fixed-size large pieces called containers, and 
each stream has a dedicated container. The non-duplicate 
fingerprints and chunk data are appended to the metadata 
part and the data part of the container. The sequence of 
fingerprints needed to reconstruct a file is also written as 
chunks and stored to containers, and a root fingerprint 
is maintained in a directory structure. When the current 
container is full, it is flushed to disk, and a new container 
is allocated for the stream. 

To identify existing chunks, a fingerprint cache avoids 
a substantial fraction of index lookups, and for those not 
found in the cache, a Bloom filter [3] identifies with high 
probability which fingerprints will be found in the on- 
disk index. Thus disk accesses only occur either when a 
duplicate chunk misses in our cache or when a full con- 
tainer of new chunks is flushed to disk. (In rare cases, a 
false positive from the Bloom filter will cause an unnec- 
essary lookup to the on-disk index.) Once a fingerprint is 
loaded, many fingerprints that were written at the same 
time are loaded with it, enabling subsequent duplicate 
chunks to hit in the fingerprint cache. 

Figure 1(b) demonstrates how to combine multiple 
deduplication nodes into a cluster. Backup software 
on each client collects individual files into a backup 


as the term “chunk” in this paper. 


stream, which it transfers to a backup server. We of- 
fer a plugin [12] that runs on a customer’s backup 
servers, which divides each stream into chunks, fin- 
gerprints them, groups them into a super-chunk, and 
routes each super-chunk to a deduplicating storage node. 
Each storage node locally applies deduplication logic to 
chunks while preserving data locality, which is essential 
to maintain high throughput. 


To clarify the parallelization that takes place in our 
cluster, consider writing a file to the cluster. When 
a super-chunk is routed to a storage node, deduplica- 
tion begins while the next super-chunk is created and 
routed to a potentially different node. All of the meta- 
data needed to reconstruct a file is stored in chunks and 
distributed across the nodes. When reading back a file, 
parallel reads are initiated to all of the nodes by looking 
ahead through the metadata references and issuing reads 
for super-chunks to the appropriate nodes. To achieve 
maximum parallelization, the I/O load should be equal 
on each node, and both read and write throughput should 
scale linearly with the number of nodes. 


Note that we do not yet specifically address the inter- 
node dependencies that arise in the event of a failure. 
Each node is highly redundant, with RAID and other data 
integrity mechanisms. It would be possible to provide re- 
dundant controllers in each node to eliminate that single 
point of failure, but these details are beyond the scope of 
this paper. 


Storage Rebalancing: When super-chunks are routed 
to a storage node, we use a level of indirection called 
a bin. We assign a super-chunk to a bin using the mod 
function, and then map each bin to a given node. By 
using many more bins (~ 1000) than actual nodes, the 
Bin Manager (running on the master node) can rebalance 
nodes by reassigning bins in the future. The Bin Manager 
also handles expansion cases such as when a node’s stor- 
age expands or when a new node is added to the clus- 
ter. In those cases, the Bin Manager reassigns bins to 
the new storage to maintain balanced usage. Rebalanc- 
ing data takes place online while backups and other op- 
erations continue, and the entire process is transparent 
to the user. After a rebalance operation, the cluster will 
generally remain balanced for future backups. The mas- 
ter node communicates the bin-to-node mapping to the 
plugin. 


Bin migration occurs when the storage usage of a node 
exceeds the average usage in the cluster by some thresh- 
old (defaulting to 5%). Note that if there is a great deal of 
skew in the total physical storage of a single bin, that bin 
can exceed the threshold even if it is the only bin stored 
on anode. Such anomalous behavior is rare but possible, 
and we discuss some examples of this in Section 5. 
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3 Data Routing 


This section addresses two issues with data routing in 
our deduplication cluster: how to group chunks into 
super-chunks, and how to route data. Super-chunk for- 
mation is relatively straightforward and is discussed in 
Section 3.1. We focus here on two routing strategies: 
stateless routing, light-weight and well suited for most 
balanced workloads (Section 3.2); and stateful routing, 
requiring more overhead but maintaining a higher dedu- 
plication rate with larger clusters (Section 3.3). 


3.1 Super-Chunk Formation 


There are two important criteria for grouping consecu- 
tive chunks into super-chunks. First, we want an average 
super-chunk size that supports high throughput. Second, 
we want super-chunk selection to be resistant to small 
changes between full backups. 

The size of a super-chunk could vary from a single 
chunk to many megabytes, or it could be equal to indi- 
vidual files as suggested by Extreme Binning [2]. We 
experimented with a variety of average super-chunk sizes 
from 8 KB up to 4 MB on backup datasets. The average 
super-chunk size affects deduplication, balance across 
storage nodes, and throughput, and it is more thoroughly 
explored in Section 5.3. We generally found that a 1 MB 
average super-chunk size is a good choice, because it re- 
sults in efficient data locality on storage nodes as well as 
generally high deduplication, and this is the default value 
used in our experiments unless otherwise noted. 

Determining super-chunk boundaries (anchoring) mir- 
rors the problem of anchoring chunks [24] in many ways 
and should be implemented in a content-dependent fash- 
ion. Since all chunks in a super-chunk are routed to- 
gether, deduplication is affected by super-chunk bound- 
aries. We represent each chunk with a feature (see the 
next subsection), compare the feature against a mask, 
and when the mask is matched, the selected chunk be- 
comes the boundary between super-chunks. Minimum 
and maximum super-chunk sizes are enforced, half and 
double the desired super-chunk size respectively. 


3.2 Stateless Routing 


Numerous data routing techniques are possible: routing 
based only on the contents of the current super-chunk is 
Stateless, while routing super-chunks using information 
about the location of existing chunks is stateful (see Sec- 
tion 3.3). 

For stateless routing, the basic technique is to pro- 
duce a feature value representing the data and then ap- 
ply a simple function (such as mod #bins) to the value to 
make the assignment. As a super-chunk is a sequence of 
chunks, we first compute a feature from each chunk, and 
then select one of those features to represent the super- 
chunk. 
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There are many options for generating a chunk fea- 
ture. A hash could be calculated over an entire chunk 
(hash (*)) or over a prefix of the bytes near an anchor 
point (nash (N), for a prefix of N bytes). Using the hash 
of a representative portion of a chunk results in data that 
are similar, but not identical, being routed to the same 
node; the net effect is to improve deduplication while in- 
creasing skew. We tried a range of prefix lengths and 
found the best results when using the first 64 bytes after 
a chunk anchor point (i.e., hash(64)), which we com- 
pare to hash(*). When using a hash for routing rather 
than deduplication, collisions are acceptable, so we use 
the first 32-bit word of the SHA-1 hash for hash (64). 

In addition, we considered other variants, such as 
fingerprints computed over sliding windows of con- 
tent [22]; these did not make a substantial difference in 
the outcome, and we do not discuss them further. 

To select a super-chunk feature based on the chunk 
features, the first, maximum, minimum, or most common 
chunk feature could be selected; using just the first has 
the advantage that it is not necessary to buffer an entire 
super-chunk before deciding where to route it, something 
that matters when hundreds or thousands of streams are 
being processed simultaneously. Another stateless tech- 
nique is to treat the feature of each chunk as a “vote” 
for a node and select the most common, which does not 
work especially well, because hash values are often uni- 
formly distributed. We experimented with a variety of 
options and found the most interesting results with four 
combinations: hash(64) of the first chunk, the mini- 
mum hash(64) across a super-chunk, hash(*) of the 
first chunk, and the minimum hash(*) across a super- 
chunk (compared in detail in Section 5.2). Elsewhere, 
hash(64) refers to the feature from the first chunk un- 
less stated otherwise. 

The main advantages of stateless techniques are (1) 
reduced overhead for recording node assignments, and 
(2) reduced requirements for recovering this state after 
a system failure. Stateless routing has some properties 
of a “shared nothing” [31] architecture because of lim- 
ited shared state. There is a potential for a loss of dedu- 
plication compared to the single-node case, and there is 
also the potential for increased data skew if the selected 
features are not uniformly distributed. We find empiri- 
cally that the reduction in deduplication effectiveness is 
within acceptable bounds, and bin migration can usually 
address excessive data skew. 


3.3 Stateful routing 


Using information about the location of existing chunks 
can improve deduplication, at an increased cost in (a) 
computation and (b) memory or communication. We 
present a stateful approach that produces deduplication 
that is frequently comparable to that of a single node 


USENIX Association 


USENIX Association 


even with a significant number of nodes (32-64); also, 
by balancing the benefit of matching existing chunks 
against the capacity of overloaded nodes, it avoids the 
need to migrate data after the fact. This approach is not a 
panacea, however, as it increases memory requirements 
(per-node Bloom filters, if storing them on a master node, 
and buffering an entire super-chunk before routing it) and 
computational overhead, as discussed below. 

To summarize our stateful routing algorithm, in its 
simplest form: 


1. Use a Bloom filter to count the number of times 
each fingerprint in a super-chunk is already stored 
on a given node. 


2. Weight the number of matches (“votes”) by each 
node’s relative storage utilization. Overweight 
nodes are excluded. 


3. If the highest weighted vote is above a threshold, 
select that node. 


4. Ifno node has sufficient weighted votes, route to the 
node selected via hash (64) of the first chunk if it is 
not overloaded; otherwise route to the least loaded 
node. 


We now explain the algorithm in more detail. To route 
a super-chunk, once the master node knows the num- 
ber of chunks in common with (a.k.a. “matching’’) each 
node, it selects a destination. However, such a “voting” 
approach requires care to avoid problematic cases: sim- 
ply targeting the node with the most matching chunks 
will route more and more super-chunks there, because 
the more data it has relative to other nodes, the more 
likely it is to match the most chunks. 

Thus, one refinement to this stateful approach is to cre- 
ate a threshold for a minimum fraction of chunks that 
must match a node before it is selected. With a uniform 
distribution, one expects each node to match at most a 
chunks on average, where C is the number of chunks in 
the super-chunk and WN is the number of nodes. Typically 
not all chunks will match any node, and the average num- 
ber of matches will be lower, but if a node already stores 
significantly more than the expected average, this is a 
reason to route the super-chunk to that node. In our sys- 
tem, a voting benefit threshold of 1.5 means that a node 
is considered as a candidate only if it already matches 
at least tae chunks. This prevents a node from being 
selected simply because it matches more than any other 
node, when no node matches well enough to be of inter- 
est. 

Simply using a static threshold for the number of 
matches to vote a super-chunk to a particular node still 
results in high data skew, as popular nodes get more 


popular over time. A technique we call weighted vot- 
ing addresses that deficiency by striking a balance be- 
tween deduplication and uniform storage utilization. It 
decreases the perceived value of known duplicates in 
proportion to the extent to which a node is overloaded 
relative to the average storage utilization of the system. 
As an example, if a node matches z€ chunks in a super- 
chunk, but that node stores 120% (8) of the average 
node, then the node is treated as though it matched ; * 2c 
chunks. Note that while a node that stores less than 
the average could be given a weight < 1, increasing the 
overall weighted value, instead we assign such nodes a 
weight of 1. This ensures that when multiple nodes can 
easily accommodate the new super-chunk, the node is 
selected based on the best match. We experimented with 
various weight functions, but we found that it is effective 
simply to exclude nodes that are above a capacity thresh- 
old. In practice, a capacity of 5% above the average was 
selected as the threshold (see Sec 5.4). 

The computational cost arises because the stateful ap- 
proach computes where every chunk in a super-chunk 
is currently stored. A Bloom filter lookup has to be 
performed for each chunk, on each node in the cluster, 
before a routing destination can be picked. Each such 
lookup is extremely fast (~ 100 — 200ns), but there can 
be a great many of these lookups: inserting M@ chunks 
into an N-node cluster would result in NM Bloom filter 
lookups, compared to M lookups in a single-node sys- 
tem. The additional overhead in memory or communi- 
cation depends on whether the master node(s) tracks the 
state of each storage node (resulting in substantial mem- 
ory allocations) or sends the chunk fingerprints to the 
storage nodes and collects counts of how many chunks 
match each node (resulting in communication overhead). 
One way to mitigate the effect is to sample [20] chunks 
that are used for voting. We reduce the number of chunks 
considered by checking each chunk’s fingerprint for a 
bit pattern of a specific length (e.g., B bits must match 
a pattern for a 1/2? sampling rate); the total number of 
lookups is then approximately NM/28. Without sam- 
pling, the total cost of the Bloom filter lookups is about 
1.2 hours of computation for a 5-TB dataset, but a sam- 
pling rate of 1/8 cuts this to 13 minutes of overhead with 
a nominal reduction in deduplication (see Section 5.5). 
That work can further be parallelized across back-ends 
or in threads on the front-end. 

As an example, the general approach to weighted vot- 
ing is depicted in Figure 2. In this example, the seven 
numbered chunks in this super-chunk are sampled for 
voting. Chunks 1, 3, and 4 are contained on node 1, 
chunks 2, 3, 5, and 6 are on node 2, chunk 5 is also on 
node 4, and chunk 7 is not stored on any node. Node 
1 has 3 raw votes, and node 2 has 4. Factoring in 
space, since node 2 uses much more than the average, 
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Relative Physical GO No Match 
Storage Usage 
0.83 1.35 0.79 1.03 
Weighted 3 4 ae 
votes 107 3.00 1.357 2.96 0 1.037 0.97 


Figure 2: Weighted voting example. A node with many 
matches will be selected if it does not also have too much 
data already, relative to the other nodes. Any node with a 
relative storage usage of less than | is treated as though 
it is at the average. 


its weighted votes are (4/1.35) = 2.96. Node | has a 
slightly higher weighted vote of 3. The minimum weight 
for a node to be selected is 1x7 = 2.6. Thus node | is 
selected for routing. 

The main advantage of a stateful technique is the op- 
portunity to incorporate expected deduplication and ca- 
pacity balancing while assigning chunks to nodes. On 
the other hand, computational or communication over- 
head must be considered when choosing this technique, 
though it is an attractive option for coping with unbal- 
anced workloads or cluster sizes beyond our current ex- 
pectations. 


4 Experimental Methodology 


We use trace-driven simulation to evaluate the tradeoffs 
of the various techniques described in the previous sec- 
tion. This section describes the datasets used, the evalu- 
ation metrics, and the details of the simulator. 


4.1 Datasets 


In this paper, we simulate super-chunk routing for nine 
datasets. Three were collected from large backup envi- 
ronments representing typical scenarios where a backup 
server hosts multiple data types from dozens of clients. 
These datasets contain approximately 40-50 TB precom- 
pressed data. To analyze how our routing technique han- 
dles datasets with specific properties, we also analyze 
five datasets representing single data types. Four of the 
datasets are each approximately 5 TB and a fifth is about 
13 TB. In addition, we synthesize a “blended” dataset 
consisting of a mixture of the five smaller datasets. In 
general, we use them in the form that a deduplication 
appliance would see them: tar files that are usually 
many gigabytes in size, rather than individual small files. 
With the exception of the “blended” dataset, all of these 
datasets represent real backups from production environ- 
ments. 
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Name — ei aE Dedup. | Months 
Collection 1 | 40,695 | 2,867 6.1 1-2 
Collection 2 | 44,536 | 1,536 | 11.5 4-6 
Collection 3 | 51,584 | 2,150 6.1 3 
Perforce 4,574 250 | 20.8 6 
Workstations 4,926 200 5.6 6 
Exchange 5,253 33 | 6.8 7 
System Logs 5,436 122 | 38.7 4 
Home Dirs. 12,907 855 | 19.3 3 
Blended 33,097 N/A | 12.5 N/A 























Table 1: Summary of datasets. The Collection datasets 
were collected from backup servers with multiple data 
types, and the other datasets were collected from sin- 
gle data-type environments. Deduplication ratios are ob- 
tained from a single-node system. 


For the three collected datasets, we received permis- 
sion to analyze production backup servers within EMC. 
We gathered traces for each file including the timestamp, 
sequence of chunk fingerprints, and other metadata nec- 
essary to analyze chunk routing. At an earlier collection 
on internal backup servers, we gathered copies of backup 
files for the individual data types. 

Table 1 lists salient information of these datasets: the 
total logical size, the daily peak size, the single-node 
deduplication rate, and the number of months in the 99th 
percentile of retention period. The datasets are: 
Collection 1: Backups from approximately 100 clients 
consisting of half software development and half busi- 
ness records. Backups are retained 1-2 months. 
Collection 2: Backups from approximately 50 engineer- 
ing workstations with 4 months of retention and servers 
with 6 months of retention. 

Collection 3: Backups of over 100 clients for Exchange, 
SQL servers, and Windows workstations with 3 months 
of retention. 

Perforce: Backups from a version control repository. 
Workstations: Backups from 16 workstations used for 
build and test. 

Exchange: Backups from a Microsoft Exchange server. 
Each day contains a single full backup. 

System Logs: Backups from a server’s /var directory, 
containing numerous system files and logs. Full backups 
were created weekly. 

Home Directory: Backups from engineers’ home di- 
rectories, containing source code, office documents, etc. 
Full backups were created weekly. 

Blended: To explore the effects of multiple datasets be- 
ing written to a storage system (a common scenario), 
we created a blended dataset. We combined alternating 
super-chunks of the single data-type datasets, weighted 
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by overall size; thus there are approximately two super- 
chunks from the “home directory” dataset for each super- 
chunk of the others. The overall deduplication for this 
dataset (12.5) is somewhat higher than the weighted aver- 
age across the datasets (12.3), due to some cross-dataset 
commonality. 

While our experiments studied all of these datasets, 
because of space limitations, we typically only present 
results for two: Workstations and Exchange. Exper- 
iments with Workstations have results consistent with 
the other datasets and represents our expected customer 
experience. The Exchange dataset showed consistently 
worse results with our techniques and is presented for 
comparison. Because of data patterns within Exchange, 
using a 1-MB super-chunk results in overloading a single 
bin with 1/16 of the data. 


4.2. Evaluation Metrics 


The principal evaluation metrics are: 

Total Deduplication (TD): The ratio of the original 
dataset size to the size after identical chunks are elim- 
inated. (We do not consider local compression (e.g., 
Lempel-Ziv [37]), which is orthogonal to the issues con- 
sidered in this paper.) 

Data Skew: The ratio of the largest node’s physical 
(post-deduplication) storage usage to the average usage, 
used to evaluate how far from this perfect balance a par- 
ticular configuration is. High skew leads to a node filling 
up and duplicate data being written to alternative nodes, 
as discussed in Section 2. 

Effective Deduplication (ED): Total Deduplication di- 
vided by Data Skew, as a single utility measure that en- 
compasses both deduplication effectiveness and storage 
imbalance. ED is equivalent to Total Deduplication com- 
puted as if every node consumes the same amount of 
physical storage as the most loaded node. This metric 
is meaningful because the whole cluster degrades when 
one node is filled up. ED permits us to compare routing 
techniques and parameter options with a single value. 
Normalized ED: ED divided by deduplication achieved 
by a single-node system. This is an indication of how 
close a super-chunk routing method is to the ideal dedu- 
plication achievable on a cluster system. It allows us 
to compare the effectiveness of chunk-routing methods 
across different datasets under the same [0, 1] scale. 
Fingerprint Index Lookups: Number of on-disk index 
lookups, used as an approximation to throughput. The 
lookup rate is the number of lookups divided by the num- 
ber of chunks processed by a storage node. 


4.3 Simulator 


Most of the results presented in this paper come from a 
set of simulations, organized as follows: 
1. For the Collection datasets, we read from a dedu- 


plicating storage node and reconstructed files based on 
metadata to create a full trace including the chunk size, 
its hash(*) value, and its hash(64) value. The other 
datasets were preprocessed by reading in each file, com- 
puting chunks of a particular average size (typically 
8 KB), and storing a trace. 

2. The per-chunk data are passed into a program to 
determine super-chunk boundaries and route those super- 
chunks to particular nodes. It produces statistical infor- 
mation about deduplication rates, data skew, the number 
of Bloom filter lookups performed, and so on. In addi- 
tion, it logs the SHAI hash and location of each super- 
chunk, on a per-node basis. Its parameters include the 
super-chunk routing algorithm; the average super-chunk 
size (typically 1 MB); the maximum relative node size 
before bin migration is performed (for stateless) or node 
assignment is avoided (for stateful), defaulting to 1.05; 
some stateful routing parameters described below, and 
several others not considered here. 

The simulator was validated in part by comparing 
deduplication results for Total Deduplication and skew 
to the values reported by the live two-node system. Due 
to minor implementation differences, normalized TD is 
typically up to 2—3% higher in the simulator than in the 
live system, though in one case the real system reported 
slightly higher normalized deduplication. Skew is simi- 
larly close. 

The stateful routing parameters are: (a) Vote sampling: 
what fraction of chunks, on average, should be passed 
to the Bloom filters and checked for matches? (Default: 
1/8.) (b) Vote threshold: how many more matches than 
the average (as a fraction) should an average-sized node 
be, before being used rather than the node routed by the 
first chunk? (Default: 1.5) 

3. To analyze caching effects on a storage system, 
each of the node-specific super-chunk files can be used 
to synthesize a data stream with the same deduplication 
patterns and chunk sizes, which speeds up experimenta- 
tion relative to reading the original data repeatedly. For 
simplicity, the compression for the synthesized chunks 
was fixed at 2:1, a close approximation to overall com- 
pression for the datasets used. This stream is then written 
to a deduplication appliance, sending each bin to its final 
node in the original simulations after migration. 

The accuracy of using a synthesized stream in place 
of the original dataset was validated by comparing Total 
Deduplication of several synthesized results to those of 
original datasets. 


5 Experimental Results 


We focused our experiments on analyzing the impact of 
super-chunk routing on capacity and fingerprint index 
lookups across a range of cluster sizes and a variety of 
datasets. We start by surveying how different routing ap- 
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Figure 3: Normalized ED of the stateless and stateful techniques as a function of the number of nodes. The top row 
represents the collected “real-world” datasets. Stateful and hash (64) (mig) use a capacity threshold of 5%. 


proaches fare over a broad range of datasets and clus- 
ter sizes (Section 5.1). This gives a picture of how To- 
tal Deduplication and skew combine into the Effective 
Deduplication metric. Then we dive into specifics: 


e What is the best feature (hash(64) vs. hash(*), 
routing by first chunk vs. all chunks in a super- 
chunk) for routing super-chunks (Section 5.2)? 


e How does super-chunk size affect fingerprint cache 
lookups and locality (Section 5.3)? 


e How sensitive is the system to various parameter 
settings, including capacity threshold (Section 5.4) 
and those involved in stateful routing (Section 5.5)? 


5.1 Overall Effectiveness 


We first compare the basic techniques, stateless and 
stateful, across a range of datasets. Figure 3 shows a 
scatter plot for the nine datasets and three algorithms: 
hash(64) without bin migration, hash(64) with a 5% 
migration threshold, and stateful routing with a 5% ca- 
pacity limitation. 

In general, hash(64) without migration works well 
for small clusters (2-4 nodes) but degrades steadily as 
the cluster size increases. Adding bin migration greatly 
improves the ED for most of the datasets, though even 
with bin migration, ED for Exchange decreases rapidly 
as the number of nodes increases, and there is also a 
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sharp decrease for Home Directories and Blended at 
64 nodes. This skew occurs when a single bin is substan- 
tially larger than the average node utilization (see Sec- 
tion 5.4). Stateful routing is often within 10% of the 
single-node deduplication even at 64 nodes, although for 
some datasets the gap is closer to 20%. However, there 
is additional overhead, as discussed in Section 5.5. 


Table 2 presents normalized Total Deduplication (TD), 
data skew, and Effective Deduplication (ED) for several 
datasets, as the number of nodes varies (corresponding to 
the hash(64) (mig) and stateful curves in Figure 3). It 
shows how a moderate increase in skew results in a mod- 
erate reduction in ED (Workstations), but Exchange 
suffers from both repeated data (losing 5 of TD) and sig- 
nificant skew (further reducing ED by a factor of 4). 


5.2 Feature Selection 


As discussed in Section 3.2, there are a number of ways 
to route a super-chunk. Here we compare four super- 
chunk features: hash(64) of the first chunk, the mini- 
mum of all hash(64), the hash (*) of the first chunk, or 
the minimum of all hash(*). We also compare against 
the method used by HYDRAstor [8], which consists of 
64-KB chunks routed based on their fingerprint. Figure 4 
shows the normalized ED of these four features for two 
datasets, not factoring in any capacity limitations. For 
Workstations, all four choices are similarly effective, 
which is consistent with the other datasets that are not 
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# hash(64) stateful 
nodes TD | Skew | ED TD | Skew | ED 
Collection 1 

1 || 1.00 1.00 | 1.00 || 1.00 1.00 | 1.00 
2 || 0.93 1.02 | 0.91 || 0.95 1.00 | 0.95 
4 || 0.89 1.03 | 0.86 || 0.95 1.00 | 0.95 
8 || 0.86 1.03 | 0.84 || 0.95 1.01 | 0.94 
16 || 0.85 1.04 | 0.81 |} 0.94 1.04 | 0.91 
32 || 0.83 1.04 | 0.80 |} 0.94 1.04 | 0.91 
64 || 0.83 1.07 | 0.77 |} 0.94 1.05 | 0.90 
Collection 2 
1 || 1.00 1.00 | 1.00 || 1.00 1.00 | 1.00 
2 || 0.97 1.00 | 0.97 || 0.98 1.00 | 0.98 
4 || 0.94 1.02 | 0.92 || 0.97 1.00 | 0.97 
8 || 0.92 1.04 | 0.88 || 0.97 1.00 | 0.97 
16 || 0.90 1.04 | 0.86 |} 0.97 1.00 | 0.97 
32 || 0.88 1.04 | 0.85 || 0.96 1.00 | 0.96 
64 || 0.87 1.04 | 0.84 |] 0.96 1.00 | 0.96 
Collection 3 
1 || 1.00 1.00 | 1.00 |} 1.00 1.00 | 1.00 
2 || 0.92 1.01 | 0.92 |} 0.95 1.01 | 0.94 
4 || 0.88 1.05 | 0.84 |) 0.95 1.03 | 0.93 
8 || 0.85 1.04 | 0.82 |] 0.96 1.04 | 0.92 
16 || 0.84 1.05 | 0.80 |} 0.96 1.05 | 0.91 
32 || 0.83 1.03 | 0.80 || 0.95 1.05 | 0.91 
64 || 0.82 1.07 | 0.77 || 0.95 1.05 | 0.91 
Workstations 
1 || 1.00 1.00 | 1.00 |} 1.00 1.00 | 1.00 
2 || 0.97 1.02 | 0.95 |] 0.98 1.00 | 0.98 
4 || 0.95 1.02 | 0.93 || 0.98 1.01 | 0.97 
8 || 0.94 1.04 | 0.90 || 0.98 1.04 | 0.94 
16 || 0.92 1.05 | 0.88 || 0.98 1.04 | 0.94 
32 || 0.91 1.04 | 0.88 |] 0.97 1.03 | 0.94 
64 |) 0.91 1.05 | 0.86 || 0.97 1.04 | 0.93 
Exchange 
1 || 1.00 1.00 | 1.00 || 1.00 1.00 | 1.00 
2 || 0.86 1.01 | 0.86 |] 0.89 1.00 | 0.89 
4 || 0.78 1.01 | 0.77 || 0.87 1.02 | 0.85 
8 || 0.72 1.04 | 0.69 || 0.87 1.02 | 0.85 
16 || 0.68 1.08 | 0.63 || 0.87 1.01 | 0.86 
32 || 0.67 2.09 | 0.32 || 0.87 1.05 | 0.83 
64 || 0.65 4.12 | 0.16 || 0.87 1.04 | 0.83 



































Table 2: Total Deduplication (TD), data skew, and nor- 
malized Effective Deduplication ratio (ED = nent -) for 
some of the datasets, using capacity thresholds of 5%. 





shown. Exchange demonstrates the extreme case, in 
which most chunk-routing features degrade badly with 
large clusters. One can see the effect of high skew when 
a common feature results in distinct chunks being routed 
to the same node. This is less common when the entire 
chunk’s hash is used than when a prefix is used: first 
hash(*) spreads out the data more, resulting in less data 
skew and better ED. Even though chunks are consistently 
routed with the HYDRAstor technique (HYDRAstor), the 


ED is generally worse than the other techniques because 
of the larger chunk size: the deduplication is less than 
half that achieved with 8-KB chunks on a single node. 

The figure demonstrates that first hash(64) is 
generally somewhat better for smaller clusters, while 
first hash(*) is better for larger ones. (This effect 
arises because first hash(64) is more likely to keep 
putting even somewhat similar chunks on the same node, 
which improves deduplication but increases skew.) Us- 
ing the minimum of either feature, as Extreme Binning 
does for hash(*), generally achieves similar dedupli- 
cation to using the first chunk. Due to its effectiveness 
with the cluster sizes being deployed in the near future 
and its reduction in buffer requirements, we use first 
hash (64) as the default and refer to it as hash (64) for 
simplicity elsewhere. 


5.3. Factors Impacting Cluster Throughput 


A major goal of our architecture is to maximize through- 
put as the cluster scales, and in a deduplicating sys- 
tem, the main throughput bottleneck is fingerprint index 
lookups that require a random disk read [36]. We are not 
able to produce a throughput measure in MB/s through 
simulation, so we use fingerprint index lookups as an in- 
direct measure of throughput. 

There are two important issues involving fingerprint 
index lookups to consider. The first is the total number 
of fingerprint index lookups that take place, since this is 
a measure of the amount of work required to process a 
dataset and is impacted by data skew. The second is the 
rate of fingerprint index lookup, which indicates the lo- 
cality of data written to disk. These values are impacted 
both by the super-chunk size and number of nodes in a 
cluster, and we have selected a relatively large cluster 
size (32 nodes) while varying the super-chunk size. 

Early generations of backups (the first few weeks of a 
dataset) tend to be laid out sequentially because of a low 
deduplication rate, while higher generations of backups 
are more scattered. To highlight this impact, we ana- 
lyzed the caching effects while writing the final 1 TB of 
each synthesized dataset across the N nodes. In these ex- 
periments, the cache size is held at 12,500 fingerprints. 
While this may seem small, it is similar to a cache of 
400,000 fingerprints on a single, large node, Also, a 
cache must handle multiple backup streams, while our 
experiments use one dataset at a time. 

Figure 5 shows the skew of the uncompressed (log- 
ical) data, maximum normalized total number of finger- 
print index lookups, maximum normalized fingerprint in- 
dex lookup rate, and ED when routing super-chunks of 
various sizes for (a) Workstations and (b) Exchange. 
Note that we report skew of the logical data here instead 
of skew of the post-dedupe data reported elsewhere, be- 
cause fingerprint lookups happen on logical data. The 
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Figure 4: Normalized ED versus number of nodes with various features. No bin migration is performed. The HYDRA- 
stor points represent 64-KB chunks routed without super-chunks, with virtually no data skew but significantly worse 
deduplication in most cases. Workstations is representative of many other datasets, while Exchange is anomalous. 
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Figure 5: Skew of data written to nodes (pre-deduplication), maximum number of fingerprint index lookups and lookup 
rate, and ED versus super-chunk size for a 32-node cluster. Fingerprint index lookup values are normalized relative 
to those metrics when routing individual 8-KB chunks. As the super-chunk size increases, the maximum number of 
on-disk index lookups decreases for Workstations (improving throughput), while effective deduplication decreases. 
Workstations is representative of many other datasets, while Exchange is anomalous. 


fingerprint index lookup numbers are normalized relative 
to the rate seen when routing individual 8-KB chunks. 
Because the lookup rate improvement achieved by us- 
ing larger super-chunk sizes generally comes with a cost 
of lower deduplication, we also plot normalized ED to 
aid the selection of an appropriate super-chunk size. It 
should be noted that we found smaller differences in 
lookup rate and total number of lookups with smaller 
clusters. 


For Workstations, we see that the total number of 
fingerprint index lookups and rate generally shrink as 
we use larger super-chunk sizes. Routing 4-MB super- 
chunks results in ~ 65% of the maximum total index 
lookups compared to routing chunks. Though data skew, 
maximum lookup rate, and maximum number of lookups 
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tend to follow the same trends, the values for maximum 
number of lookups and maximum lookup rate may come 
from different nodes. 


The index lookups (both total and rate) for Exchange 
around | MB highlights a case where our technique may 
perform poorly due to a frequently repeating pattern in 
the data set that causes a large fraction of the hash (64) 
values to map to the same bin. With smaller super-chunk 
sizes, less data are carried with each super-chunk, so 
skew can be reasonably balanced via migration, and for 
larger super-chunks, the problematic hash(64) value is 
no longer selected. For this dataset, a super-chunk size of 
1 MB results in higher skew that lowers ED, and it has a 
high total number of lookups and worst-case cache miss 
rate. This is a particularly difficult example for our sys- 
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Figure 6: Normalized ED as a function of capacity 
threshold on 32 nodes, for hash(64) and stateful, and 
peak fraction of data movement (DM) for hash(64). 
Note that lower points are better for data movement, 
while higher is better for ED. 


tem as the same node had both the highest lookup rate 
and skew, which roughly multiply together to equal total 
lookups. 

Although any particular super-chunk size can poten- 
tially result in skew if patterns in the data result in one bin 
being selected too often, the problem is rare in practice. 
Thus, despite this one poor example, we decided that 
1-MB super-chunks provide both reasonable throughput 
and deduplication and use that as the default super-chunk 
size in our other experiments. 

The scalability of our cluster design could more thor- 
oughly be analyzed with a comparison of the number of 
fingerprint index lookups for various cluster sizes relative 
to the single node case. Intuitively, a single-node system 
might have similar lookup characteristics to nodes in a 
cluster when routing very large super-chunks and with- 
out data skew. 


5.4 Space Usage Thresholds 


Limitations on storage use arise in two contexts. For 
stateless routing, we periodically migrate bins away from 
nodes storing more than the average, if they exceed a 
fixed threshold relative to the mean. In the simulations, 
bin migration takes place after multiple 1-TB epochs 
have been processed, totaling ~ 20% of a given dataset. 
This means that we attempt migrations approximately 5 
times per dataset regardless of size, plus once more at 
the end, if needed. For stateful routing, we refrain from 
placing new data on a node that is already storing more 
than that threshold above the average. 

Figure 6 demonstrates the impact of the capacity 
threshold on ED and peak data movement, using the 
Workstations and Exchange datasets on 32-nodes. 
The top four curves show ED: for Workstations, dedu- 
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Figure 7: Effective Deduplication as a function of the 
amount of data processed, with and without bin migra- 
tion at a 5% threshold, for the Workstations dataset 
on 2 and 64 nodes. Migration points are marked along 
the top, every | TB, with deduplication computed every 
0.1 TB. The deduplication for a single node is depicted 
as the top curve. 


plication effectiveness improves with increasingly tight 
capacity bounds, although the benefit below 5% is mini- 
mal, while for Exchange, the existence of a single over- 
sized bin when using 1 MB super-chunks ensures a large 
skew regardless of threshold in the case of hash(64). 
The bottom two curves provide an indication of the 
impact of bin migration on data movement, as the thresh- 
old changes. We compute the fraction of data moved 
from a node at the end of an epoch, relative to the amount 
of physical data stored on the node at the time of the mi- 
gration, and report the maximum across all nodes and 
epochs. Exchange moves 15-20% of the incoming data 
(which is on the order of 4 of 1 TB) without improv- 
ing ED, while we would migrate at most a few percent 
of one node’s data for Workstations. Note that across 
the entire dataset, migration accounts for at most 7000 
of the data, and on the 2-node commercial systems cur- 
rently deployed, they have never occurred. Because at 
32 nodes we do see small amounts of migration even for 
the Workstations dataset, and increasing the thresh- 
old from 1.01 to 1.05 reduces the total data migrated by 
nearly a factor of 2 without much of an impact on ED, we 
use 1.05 as the default threshold in other experiments. 
Figure 7 shows the impact of bin migration over time 
on the Workstations dataset. The curves for 2 nodes 
are identical, as no migration was performed. The curves 
for 64 nodes are significantly different, with the curve 
without migration having much worse ED. However, 
even with migration, the ED drops between migration 
points due to increasing skew. Note that this graph does 
not normalize deduplication relative to a single node, in 
order to highlight the effect of starting with entirely new 
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Sampling Workstations Exchange Memory 
Rate Look- Look- 

(1-in-N) uaa ups (B) sd ups (B) (Sm) 
1 5.28 20.57 || 5.74 21,95 96 

2 5.27 10.83 || 5.72 11.62 48 

4 5.27 6.03 || 5.76 6.46 24 

8 5.31 3.61 |} 5.63 3.88 12 

16 5.27 2.41 || 5.46 2.59 6 

32 5.19 1.81 |} 5.13 1.95 3 
































Table 3: The ED, Bloom filter lookups in billions, and 
Bloom filter memory requirements in a 32-node system, 
for two of the datasets. They vary as a function of the 
sampling rate: which chunks are checked for existence 
on each node. The memory requirement is independent 
of the dataset. 


data, then increasing deduplication over time. 
5.5 Parameters for Stateful Routing 


In addition to capacity limitations, stateful routing is pa- 
rameterized by vote sampling and vote threshold as ex- 
plained in Section 4.3. Sampling has a great impact on 
the number of fingerprint lookups, while surprisingly, the 
system is not very sensitive to a threshold requiring a 
node to be a particularly good match to be selected. 

We evaluated sampling across a variety of datasets and 
cluster sizes, varying the selectivity of the anchors used 
to vote from 1 down to +. Table 3 reports the effect 
of sampling on ED and Bloom filter lookups for two of 
the datasets. (Slight rises in ED with less frequent sam- 
pling result from slightly lower skew due to not match- 
ing a node quite as often.) The last column of the table 
shows the size of a Bloom filter on a master node for a 
1% false positive rate and up to 20 TB of unique 8-KB 
chunks on each node; it demonstrates how the aggregate 
memory requirement on the master would decrease as the 
sampling rate decreases. The required size to track each 
node is multiplied by the number of nodes, 32 in this 
experiment. Each node would also have its own local 
Bloom filter, which would be unaffected by the sampling 
rate used for stateful routing. If lookups are forwarded 
to each node, sampling would be used to limit the num- 
ber of lookups, but the per-node Bloom filters used for 
deduplication would be used for routing, and no extra 
memory would be required. 

We found that the ED is fairly constant from looking up 
all chunks (a sampling rate of 1) down to a rate of ; and 
often similar when sampling is it degrades significantly, 
as expected, when less than that. Thus we use a default 
of Z for stateful routing elsewhere in this paper. 

We also examined the vote benefit threshold. While 
we use a default of 1.5, the system is not very sensitive 
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to values from 0.75 to 2. The key is to have a high enough 
threshold that a single chunk will not “attract” more and 
more dissimilar chunks due to one match. 


6 Cluster Deduplication Product 


EMC now makes a product based on this technol- 
ogy [13]. The cluster configuration currently consists of 
two nodes and uses the hash (64) routing technique with 
bin migration. Each node has the following hardware 
configuration: 4 socket processor, 4 cores per socket, and 
each core is running at 2.93 Ghz; 64 GB of memory; 
four 10-Gb Ethernet interfaces (one for external traffic 
and one for inter-node traffic, both in a fail-over pair); 
and 140 TB of storage, consisting of 12 shelves of 1-TB 
drives. Each shelf has 16 drives in a 12+2 RAID-6 con- 
figuration with 2 spare drives. 

The total physical capacity of the two-node system is 
280 TB. Under typical backup usage, total compression 
is expected to be 20X, which leads to a logical capac- 
ity of 5.6 PB of storage. Write performance with mul- 
tiple streams is over 3 GB/s. Note that this performance 
was achieved with processing on the backup server as de- 
scribed in Section 2, which communicates with storage 
nodes to filter duplicate chunks before network transfer. 
Because of the filtering step, logical throughput (file size 
divided by transfer time) can even exceed LAN speed. 

We measured the steady-state write and read perfor- 
mance with 1-4 nodes and found close to linear improve- 
ment as the number of nodes increases. While simula- 
tions suggest our architecture will scale to a larger num- 
ber of nodes, we have not yet tuned our product for a 
larger system or run performance tests. 

In over six months of customer usage, bin migration 
has never run, which indicates stateless routing typically 
maintains balance across two nodes. 


7 Related Work 


Chunk-based deduplication is the most widely used 
deduplication method for secondary storage. Such a 
system breaks a data file or stream into contiguous 
chunks and eliminates duplicate copies by recording ref- 
erences to previous, identical chunks. Numerous stud- 
ies have investigated content-addressable storage us- 
ing whole files [1], fixed-size blocks [27, 28], content- 
defined chunks [17, 24, 36], and combinations or com- 
parisons of these approaches [19, 23, 26, 32]; generally, 
these have found that using content-defined chunks im- 
proves deduplication rates when small file modifications 
are stored. Once the data are divided into chunks, it is 
represented by a secure fingerprint (e.g., SHA-1) used 
for deduplication. 

A technique to decrease the in-memory index require- 
ments is presented in Sparse Indexing [20], which uses a 
sampling technique to reduce the size of the fingerprint 
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index. The backup set is broken into relatively large re- 
gions in a content-defined manner similar to our super- 
chunks, each containing thousands of chunks. Regions 
are then deduplicated against a few of the most similar 
regions that have been previously stored using a sparse, 
in-memory index with only a small loss of deduplication. 

While Sparse Indexing is used in a single system to re- 
duce its memory footprint, the notion of sampling within 
a region of chunks to identify other chunks against which 
new data may be deduplicated is similar to our sam- 
pling approach in stateful routing. However, we use 
those matches to direct to a specific node, while they use 
matches to load a cache for deduplication. 

Several other deduplication clusters have been pre- 
sented in the literature. Bhagwat et al. [2] describe a 
distributed deduplication system based on “Extreme Bin- 
ning”: data are forwarded and stored on a file basis, and 
the representative chunk ID (the minimum of all chunk 
fingerprints of a file) is used to determine the destination. 
An incoming file is only deduplicated against a file with 
a matching representative chunk ID rather than against 
all data in the system. Note that Extreme Binning is in- 
tended for operations on individual files, not aggregates 
of all files being backed up together. In the latter case, 
this approach limits deduplication when inter-file local- 
ity is poor, suffers from increased cache misses and data 
skew, and requires multiple passes over the data when 
these aggregates are too big to fit in memory. 

DEBAR [34] also deduplicates individual files written 
to their cluster. Unlike our system, DEBAR deduplicates 
files partially as they are written to disk and completes 
deduplication during post-processing by sharing finger- 
prints between nodes. 

HYDRAstor [8] is a cluster deduplication storage 
system that creates chunks from a backup stream and 
routes chunks to storage nodes, and HydraFS [33] is 
a file system built on top of the underlying HYDRA- 
stor architecture. Throughput of hundreds of MB/s is 
achieved on 4-12 storage nodes while using 64 KB-sized 
chunks. Individual chunks are routed by evenly parti- 
tioning fingerprint space across storage nodes, which is 
similar to the routing techniques used by Avamar [11] 
and PureDisk [7]. In comparison, our system uses larger 
super-chunks for routing to maximize cache locality and 
throughput but also uses smaller chunks for deduplica- 
tion to achieve higher deduplication. 

Choosing the right chunking granularity presents a 
tradeoff between deduplication and system capacity and 
throughput even in a single-node system [35].  Bi- 
modal chunking [18] is based on the observation that 
using large chunks reduces metadata overhead and im- 
proves throughput, but large chunks fail to recover some 
deduplication opportunities when they straddle the point 
where new data are added to the stream. Bimodal chunk- 


ing tries to identify such points and uses a smaller chunk 
size around them for better deduplication. 


8 Conclusion and Future Work 


This paper presents super-chunk routing as an important 
technique for building deduplication clusters to achieve 
scalable throughput and capacity while maximizing ef- 
fective deduplication. We have investigated properties of 
both stateless and stateful versions of super-chunk rout- 
ing. We also describe a two-node deduplication storage 
product that implements the stateless method to achieve 
3 GB/sec deduplication throughput with the capacity to 
store approximately 5.6 PB of backup data. 

Our study has three conclusions. First, we have found 
that using super-chunks, a multiple of fine-grained dedu- 
plication chunks, for data routing is superior to using 
individual chunks to achieve scalable throughput while 
maximizing deduplication. We have demonstrated that a 
1-MB super-chunk size is a good tradeoff between index 
lookups, which directly impact deduplication through- 
put, and effective cluster-wide deduplication. 

Second, the stateless routing method (hash (64) ) with 
bin migration is a simple and yet efficient way to build 
a deduplication cluster. Our simulation results on real- 
world datasets show that this method can achieve good 
balance and scalable throughput (good caching locality) 
while achieving at least 80% of the single-node effective 
deduplication, and bin migration appears to be critical to 
the success of the stateless approach in larger clusters. 

Third, our study shows that effective deduplication of 
the stateless routed cluster for certain datasets (most no- 
tably Exchange) may drop quickly as the number of 
nodes increases beyond 4. To solve this problem, we 
have proposed a stateful data routing approach. Simula- 
tions show this approach can achieve 80% or better nor- 
malized ED when using up to 64 nodes in a cluster, even 
for “pathological” cases. 

Several issues remain open. First, we would like to 
further our understanding of the conditions that cause se- 
vere data skew with the stateless approach. To date, no 
bin migration has occurred in the production system de- 
scribed in this paper; this is not surprising considering 
that ED for hash (64) on two nodes is virtually identical 
for each of our datasets, with or without bin migration. 
The same is true for most, but not all, of the datasets as 
the cluster size increases moderately. Second, we plan 
to examine the scalability of the system across a broad 
range of cluster sizes and the impact of parameters such 
as feature selection and super-chunk size. Third, we want 
to explore the use of bin migration to support reconfigu- 
ration such as node additions. Finally, we plan to build a 
prototype cluster with stateful routing so that more thor- 
ough experiments can be conducted in lab and in cus- 
tomer environments. 
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Abstract 


Shared storage underlies most enterprise VM deploy- 
ments because it is an established technology that admin- 
istrators are familiar with and because it good job of pro- 
tecting data. However, shared storage is also very expen- 
sive to scale. This paper describes Capo!, a transparent 
and persistent block request proxy for virtual machine 
disk images. Capo reduces the load on shared storage by 
using local disks as persistent caches, using multicast- 
based preloading to broadcast read results across a clus- 
ter, and by imposing differential durability — dividing a 
VM’s file system into regions of varying writeback fre- 
quency. We motivate the system’s design through the 
analysis of a week-long trace of 55 production virtual 
desktops and then describe and evaluate our implemen- 
tation. Capo is particularly well suited for virtual desk- 
top deployments, in which large numbers of VMs boot 
from a small number of “gold master” images and are 
refreshed on a periodic basis. 


1 Introduction 


The storage we trust is expensive. Fast and reliable data 
storage is something that organizations are prepared to 
pay a premium for, both in the capital costs of enterprise 
storage hardware and the operational costs of ensuring 
that important data is written to it. 

Interestingly, the deployment of virtualization has in- 
verted the historical imperative that systems be config- 
ured to “opt-in” to storing data on appropriate network 
shares instead of on less reliable locations such as lo- 
cal disks. While administrators used to have to work to 
configure applications to use enterprise storage, virtual 
environments simply store everything on it. 


'The name of our system is borrowed from the phrase “Da Capo 
al coda”, which is used in sheet music to indicate a brief return to the 
beginning of a piece, followed by a jump to the Coda, or conclusion. 
In sonatas, this “recapitulation” involves revisiting a similar, but some- 
times different version of the main theme of the arrangement. 


As such, these environments present the opposite 
problem: The requirement that virtual machine images 
be universally accessible, with high performance, to all 
physical hosts in a cluster has necessitated the deploy- 
ment of SAN hardware in even modest virtualization 
deployments. The improved density and utilization af- 
forded by virtualization allows systems to scale to large 
numbers of VMs; shared storage must scale proportion- 
ately to provide for them. This symptom is especially 
problematic for virtual desktops, where infrastructure is 
being deployed to host literally thousands of nearly iden- 
tical VM images. A number of commercial virtual desk- 
top systems now exist, and deployments suffer from a 
significant, if not dominant, cost for enterprise storage. 

This paper argues that shared, central, storage is the 
correct approach for scalable virtual environments. It 
is trustworthy, relatively easy to manage, and simple to 
reason about. However, we believe that for applications 
such as virtual desktops, which involve large numbers 
of image clones, the majority of request load is redun- 
dant and can be effectively serviced by local, commodity 
disks within individual servers. Furthermore, we believe 
that the levels of durability provided by enterprise stor- 
age in these environments are in excess of what is neces- 
sary for large portions of desktop OS disk images. 

The contributions of this paper are twofold: First, we 
validate our hypothesis through the analysis of a week- 
long trace of all storage traffic from a production deploy- 
ment of 55 Windows Vista desktops in an executive and 
administrative office of a large public organization. Our 
results examine the opportunities that exist for caching 
data both within and across virtual desktop images. They 
also examine the breakdown of request workload within 
desktop filesystems. 

Second, based on the analysis of this trace, we de- 
scribe the design and implementation of Capo, a dis- 
tributed persistent cache that aims to reduce aggregate 
load on shared storage in virtual desktop environments. 
Capo uses local server disks to provide persistent caching 
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of VM images, and includes mechanisms to share and 
pre-load caches of gold master images across VMs and 
across hosts. Finally, Capo introduces a facility for dif- 
ferential durability, which allows administrators to selec- 
tively “opt out” of enterprise storage guarantees by relax- 
ing the durability properties of subsets of a desktop’s file 
system. 

Virtual desktop systems have already taken advantage 
of several approaches to scale storage to large numbers 
of desktop machines. We begin in Section 2 by providing 
some brief initial background on these systems. 


2 VDI Background 


Virtual desktops represent the latest round in a decades- 
long oscillation between thin- and thick-client computing 
models. So-called Virtual Desktop Infrastructure (VDI) 
systems have emerged as a means of serving desktop 
computers from central, virtualized hardware. VDIs are 
being touted as a new compromise in a history of largely 
unsuccessful attempts to migrate desktop users onto thin 
clients, and the approach does provide a number of bene- 
fits. Giving users private virtual machines preserves their 
ability to customize their environment and interact with 
the system as they would a normal desktop computer. 
From the administration perspective, consolidating VMs 
onto central compute resources has the potential to re- 
duce power consumption, allow location-transparent ac- 
cess, better protect private data, and ease software up- 
grades and maintenance. 

Commercial VDI systems appear to be experiencing 
a degree of success: Gartner predicts that forty percent 
of all worldwide desktops—49 Million in total—will be 
virtualized by 2013 [17]. Today, the two major vendors 
of VDI systems, Citrix and VMWare, individually de- 
scribe numerous case studies of active virtual desktop 
deployments of over 10,000 users. From a storage per- 
spective, VDI systems have faced immediate challenges 
around space overheads and the ability to deploy and up- 
grade desktops over time. As background, this section 
describes how these problems are typically solved in ex- 
isting architectures, as illustrated in Figure 1. 


2.1 Copy-on-Write and Linked Clones 


Operating system images are entire virtual disks, often 
tens of gigabytes each. A naive approach to support- 
ing hundreds or thousands of virtual machines results in 
two immediate storage scalability problems. First, VMs 
must have isolated disk images, but maintaining individ- 
ual copies of every single disk is impractical and con- 
sumes an enormous amount of space. Second, adding 
new users requires that images can be quickly duplicated 
without necessitating a complete copy. 
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Figure 1: Typical image management in VDI systems. 


This observation is not new; it has been a recur- 
ring challenge in virtualization. Existing VDI systems 
make use of VM-specific file formats such as Microsoft’s 
VHD [14] and VMware’s VMDK [22]. Both allow a 
sparse overlay image to be “chained” to a read-only base 
image (or gold master). As shown in Figure 1, modifi- 
cations are written to private, per-VM overlays, and any 
data not in the overlay is read from the base image. In 
this manner, large numbers of virtual disks may share 
a single gold master. This approach consolidates com- 
mon initial image data, and new images may be quickly 
cloned from a single gold master. 


2.2 Image Updates and Periodic Rollback 


Image chaining saves space and allows new images to 
be cloned from a gold master almost instantaneously. It 
is not a panacea though. Chained images immediately 
begin to diverge from the master version as VMs issue 
writes to them. One immediate problem with this diver- 
gence is the consumption of independent extra storage 
on a per-image basis. This divergence problem for stor- 
age consumption is typically addressed through the use 
of data deduplication [24, 6, 4]. 

For VDI, wasted storage is not the most pressing con- 
cern: block-level chaining means that patches and up- 
grades cannot be applied to the base image in a manner 
that merges and reconciles with the diverged clones. This 
means the ability to deploy new software or upgrades to a 
large number of VMs, which was initially provided from 
the single gold master is immediately lost. 

The leading VDI offerings all solve this problem in a 
very similar way: They disallow users from persisting 
long term changes to the system image. When gold mas- 
ter images are first created and clones are deployed, the 
VDI system arranges images to isolate private user data 
(documents, settings, etc.) on separate storage from the 
system disk itself. As suggested initially in the Collec- 
tive project [3], this approach allows a new gold master 
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with updated software to be prepared and deployed to 
VMs simply by replacing the gold master, creating new 
(empty) clones, and throwing away the old version of 
the system disk along with all changes. This approach 
effectively “freshens” the underlying system image of 
all users periodically and ensures that all users are us- 
ing a similar well-configured desktop. For the most part, 
it also means that users are unable to install additional, 
long-lived software within VDI images without support 
from administrators. 


3 Virtual Desktop Trace 


To better understand VDI workloads, we arranged to 
measure all block and file level activity from the a pro- 
duction VDI deployment for a one week period during 
the Summer of 2010. The deployment being studied is 
an office in a large public organization containing exec- 
utive and administrative support staff. The deployment 
had been in production use for six months and includes 
55 Windows Vista desktops, the organization is in the 
process of rolling out another 300 desktops this fall. 


3.1 Methodology 

We installed a Windows storage class driver into the base 
system image of the virtual desktop machines. The driver 
was written to record block read and write events to 
the virtual disks using the Microsoft Windows Software 
Trace Preprocessor (WPP). It recorded request size and 
virtual disk address. In 93% of cases we were also able 
to determine the file on which the access originated by 
following the OriginalFileObject pointer in the Windows 
I/O Request Packet (IRP) structure. To better contextu- 
alize this information, we also installed a driver at the 
filesystem level and recorded cache accesses, the appli- 
cation making each request, and the file flags for each file 
accessed. Our disk-level driver is written in 515 lines of 
C, while our file-level driver is 82 lines of C. 


Logs from these drivers were written to a network 
share and collected on the Thursday following a full 
week of logging. In total we collected 75GB of logs in 
a compressed binary format. We then checked for cor- 
ruption, missing logs, or missing events. Out of over 300 
million entries we found a single anomalous write to a 
clearly invalid block address, which we removed. We 
could find no explanation for the event. In the rest of 
this section we present our analysis of this data. Unless 
otherwise specified, we will refer to block level accesses 
to a virtual disk and measure aggregate workloads in I/O 
operations per second (IOPS). 


3.2. Our Virtual Desktop Environment 


The environment we are studying is structured very much 
like the one described in Section 2. At the time our mea- 
surements were gathered it hosted 55 Microsoft Win- 
dows Vista virtual desktops with VMWare View, of 
which roughly 27 are in dedicated day-to-day use as the 
primary desktop. This small size is the primary limi- 
tation of our study, but we expect to measure consider- 
ably more as the installation grows. Furthermore, even 
at the current size it is possible to see considerable self- 
similarity among machines, as we will discuss. 

End users work from Dell FX100 Zero thin clients, 
while VMs are served from HP BL490c G6 Blades run- 
ning ESX Server. These servers connect to a Network 
Appliance 3170s over fiber channel, for booting from 
the SAN, and 10GigE, for VM disk images. System 
images are hosted via NFS on a 14 drive RAID group 
with 2 parity disks. The operating systems and applica- 
tions are optimized for the virtual environment [20] and 
are pre-loaded with Firefox, Microsoft Office Enterprise, 
and Sophos Anti-Virus among other software. At the end 
of every Wednesday, a new system image is published to 
all users exactly as discussed in Section 2.2. 


3.3 Analysis 


We begin by asking, What are the day-to-day charac- 
teristics of VDI storage workloads? Figure 2 shows the 
entire study in I/O operations-per-second for the 24-hour 
period of each of the 7 full days recorded. There is a 
distinct peak load period between 8:30 and 9:30 every 
morning, as employees arrive at work. Three peaks in 
this period are highlighted in the figure and presented 
for expanded analysis in Table 1. The right-most col- 
umn shows the applications responsible for the most disk 
I/O, excluding the system and services. Most days, Fire- 
fox and the virus scanner are very active in this period, 
we also see Thunderbird, Pidgin, and Microsoft Outlook 
frequently. We were surprised to see the Search Indexer 
active as well, because we were told its background scan- 
ning task had been disabled to reduce I/O consumption. 
Our best guess is that it was invoked manually. 

We measured the write to read percentages for both 
IOps and throughput, which is useful in characterizing 
the workload. Our workload is write-heavy in IOps, and 
read-heavy in throughput, both by approximately two- 
to-one. We then measured the percentage of VMs which 
contributed at least 5% of the peak workload, to deter- 
mine if peaks were caused by multiple VMs or by a few 
outliers. In most cases, it is the former; however, the peak 
in slice 4 was caused primarily by 4 VMs. The column 
titled “Dup. reads” illustrates the potential for caching. 
We present two numbers. The left-most is the percent- 
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Mon. 2:00pm-2:30pm 
Tue. 2:40pm-3:00pm 
Wed. 4:00pm-4:15pm 


50% / 22% 
52% 122% 
64% / 43% 
62% / 41% 
69% 152% 
60% / 37% 





81% / 01% 

88% / 97% 

78% 199% 

59% 199% 

771% 197% 
99% | >99% 


Search Indexer, Firefox, Sophos 

Firefox, Search Indexer, Sophos 

Defrag, Firefox, Search Indexer 
Firefox, Pidgin, Sophos 
Firefox, Defrag, Pidgin 
Firefox, Pidgin, Sophos 


Table 1: Points of interest in Figure 2. 


age of reads that have been previously seen by that VM 
over the course of the trace. With a large enough cache, 
we could potentially absorb all these reads. The right- 
hand column presents the same measure, but imagines 
that caching could be shared across all VMs in the clus- 
ter. Slice 4 stands out for having an unusually low du- 
plicate read rate for VMs, but a very high rate across the 
cluster as a whole. We investigated and found that two 
very active VMs had duplicate read rates of 26% and 
30%. By including the least beneficial 38%, 15% and 
4% of VMs, we could reach duplicate read rates of 40%, 
60% and 90% respectively. From this we conclude that 
you can achieve significant improvements with caching, 
possibly even by sharing caches, but that some benefits 
may require careful selection of the VMs in question. 


Lunch, dinner, and late nights are periods of relative 
inactivity, as are the weekends. Late afternoon peaks 
are sporadic, but reach loads nearly as high as the morn- 
ing. One such peak, marked 4 in Figure 2 and Table 1, 
was caused by a relatively small number of machines en- 
gaged in heavy browsing activity. This is not the norm, 
as all other peaks occur when more than a quarter of 
the VMs are significantly active. This is clear in Fig- 
ure 3, which shows a CDF of VMs by their contribution 
to the total workload for each peak. These peak load pe- 
riods are particularly important, because they define the 
hardware necessary to service the workload without dis- 
ruption. We conclude that, VDI workloads are defined 
by their peaks, and those peaks usually occur at times 
of common activity among many VMs. In Sections 4.1 
and 4.2 we take advantage of this fact to improve perfor- 
mance in VDI environments. 
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Figure 3: Contributors to each of the major workload 
peaks. 


Next, we ask, How can we characterize the I/O re- 
quests? In Figure 4 we show the entire workload by 
both request count and size. We differentiate reads from 
writes, and also isolate each request by its target in the 
file system namespace. The workload is 65% writes, 
which account for 35% of the throughput, versus 35% 
of reads accounting for 65% of the throughput. Meta- 
data operations account for large portion of the requests; 
unfortunately we cannot determine how these modifica- 
tions relate to the namespace. Directories typically man- 
aged by the operating system, such as \Windows and 
\Program files are also frequently accessed. There 
are fewer accesses to user directories and temporary files; 
most of the latter are to \Temporary Internet 
Files, as opposed to \Windows\Temp. These find- 
ings contrast those of Vogels who’s study showed that 
93% of file-level modification occurred in \User direc- 
tories [23]. We conclude that, while a wide range of 
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Figure 4: Size and amount of block level writes by file 
system path. 
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Figure 5: Percentage of bytes that need to be written to 
the server if writes are held back for different time peri- 
ods. This is lower than the original volume of writes due 
to the elimination of rewrites. 


the namespace is accessed, it is not accessed uniformly, 
and access to data directly managed by users is rare. 
We will revisit this observation in Section 4.3.4. 


Since our workload is write-heavy, we next ask, how 
are these writes organized in time? Figure 5 shows the 
percentage of disk writes that overwrite recently writ- 
ten data, for time intervals ranging from 10 seconds to 
a whole day. We have included results from each of the 
seven days to underscore how consistent the results are. 
In a short time span, just 10 seconds, 8% of bytes that 
are written are written again. This rate increases to 20%- 
30% in 10 minute periods and ranges between 50%-55% 
for twenty-four hour periods. From this we conclude 
that, Considerable system-wide effort is spent on data 
with a high modification rate. We show how this can 
used to our advantage in Section 4.3. 

Since VMs typically use disk images chained from a 
gold master, we are interested in the rate at which the 
overlay image diverges from the original image. We 
therefore ask, At what frequency do we observe the first 
write to a sector? Figure 6 plots this data for the av- 
erage VM, as well as the most and least divergent VM, 
over the entire study. Within 24 hours, most VMs hit a 
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Figure 6: Bytes of disk diverging from the gold master. 
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Figure 7: Total divergence versus time for each names- 
pace category. 


near plateau in their divergence, around 1GB. Over time 
this does increase, but slowly. A smaller set of VMs 
do diverge more quickly and significantly, but they are 
far from the 95% confidence interval. We conclude that, 
there is significant shared data between VMs, even after 
several days of divergence. 

Naturally, we do not expect divergent writes to occur 
uniformly, so we pose a question: Where in the names- 
pace do divergent writes occur, and does this change 
over time? Figure 7 plots the cumulative divergence for 
each VM in the cluster, and divides that total among var- 
ious components of the namespace. One can observe 
that the pagefile diverges immediately, then remains a 
constant size over time, as does the system metadata. 
Both these files are bounded in size. Meanwhile writes 
to \Windows and areas of the disk we cannot associate 
with any file continue to grow over the full week of the 
study. We conclude that, While writes occur everywhere 
in the namespace, they exhibit significant trends when 
categorized according to the destination. 


3.4 Summary 


While there is more to say about this workload and 
those of VDI environments in general, the observations 
in this section are valuable. Summarizing our observa- 
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tions from the trace data: 


e VDI workloads are defined by their peaks, and those 
peaks usually occur at times of common activity 
among many VMs 


e While a wide range of the namespace is accessed, 
it is not accessed uniformly, and access to data di- 
rectly managed by users is rare 


e Considerable system-wide effort is spent on data 
with a high modification rate 


e There is significant shared data between VMs, even 
after several days of divergence 


e While writes occur everywhere in the namespace, 
they exhibit significant trends when categorized ac- 
cording to the destination 


These observations taken together suggest that addi- 
tional caching, combined with an awareness of names- 
pace organization might resolve the performance chal- 
lenges that we have observed. The following section 
builds on the observations and analysis presented here, 
and describes the architecture of Capo. 


4 Architecture 


The trace analysis in Section 3 suggests that caching be- 
low the individual VMs may be effective in resolving the 
demand peaks that we observed. In this section we de- 
scribe Capo, including its three major components: 


1. A single-host cache which eliminates redundant 
reads and writes from virtual desktops hosted on the 
same server. 


2. A multi-host cache preloader which eliminates re- 
dundant reads from virtual desktops hosted on dif- 
ferent servers. 


3. A component that supports differential durability, 
which modifies cache coherency based on the loca- 
tion in the namespace of the affected file. 


Figure 8 shows the overall architecture of Capo. Capo 
exists as a layer within the virtual machine monitor 
(VMM) which supports the individual desktop VMs. 
The figure depicts each host including a Local Persistent 
Cache which is stored on the local disk of the host ma- 
chine and is described in Section 4.1. Spanning all of the 
hosts is the Transparent Multi-host Prefetch component 
which optimistically preloads data accessed by one host 
into the local caches of the other hosts. It is described 
in Section 4.2. The Durability Map component supports 
the wide variation in the durability requirements of the 
various components of the file system. It is described in 
Section 4.3. 
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Figure 8: The Major Architectural Components of Capo. 


4.1 Local Persistent Cache 


All VDI deployments rely on central enterprise storage 
that provides high availability, durability, and reliability. 
The servers that host virtual desktops are also configured 
with local disk storage which consists of cheap COTS 
disk drives with comparably lower reliability but higher 
aggregate I/O bandwidth. 

The trace-based analysis of our local VDI deployment 
suggests that a cache shared between multiple virtual 
desktops might be very effective. As shown in Figure 3, 
there is significant overlap between the top applications 
executed on different virtual machines. Table | refines 
this and indicates that aggressive caching can yield very 
high read hit rates. Also, as shown in Figure 5, a signif- 
icant fraction of data is overwritten very quickly. There- 
fore, as depicted in Figure 8, each server machine that 
hosts virtual desktops uses its local disk as a persistent 
cache. A key goal for Capo is to provide an appropriate 
level of durability for all data while taking advantage of 
the higher aggregate bandwidth available to local disk. 
The level of durability achieved depends on the caching 
policy in place. 


4.1.1 Caching Policies 


The cache supports two consistency policies: write- 
through and write-back. These policies are enforced at 
disk image granularity. Each of these policies represents 
a different tradeoff between virtual disk consistency and 
overall system performance. 

The write-through policy provides the highest level of 
consistency guarantees a machine would expect from a 
block device. In this policy the cache replicates writes to 
both the centralized storage and the local cache. Write 
requests are not acknowledged until they hit both disks. 
This policy relieves the centralized storage from serving 
reads to blocks that have been previously read from or 
written to. The drawback of this policy is that write re- 
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quests must be sent across the network, consuming net- 
work bandwidth and increasing both the load on the cen- 
tralized storage and the client’s perceived write request 
latency. 

The write-back policy delays pushing updates to disk 
blocks by caching writes locally in the write cache. Up- 
dates are pushed to the central storage in a crash consis- 
tent manner at a per-virtual-disk configurable frequency. 
The choice of write-back frequency trades off system 
performance and durability of disk contents in case of a 
failure. A high update frequency minimizes the amount 
of data loss in case of the failure of the local disk, while 
a lower frequency enhances overall system performance 
by coalescing writes in the local cache. 


4.1.2 Design and Implementation 


The local persistent cache is implemented as an exten- 
sion to the publicly available XenServer 5.6 release. It 
runs in Xen’s “domain 0” VM and interposes on the 
block request path below virtual machines. Cached data 
is stored as sparse image files in a Linux file system. 
Each virtual disk’s cache consists of either two or three 
components, shown in Figure 9: a read store, write store, 
and possibly a snapshot store. Each of these components 
is represented using a data file and a bitmap in the per- 
sistent cache. The bitmap’s purpose is to identify which 
sectors of the corresponding data file are valid. Writing a 
sector to a cache component involves writing the sector’s 
data to the data file and setting the sector’s corresponding 
bits in the on-disk bitmap. 

Write requests are satisfied by writing their data sec- 
tors to the cache’s write store. When the cache is set to 
write-through, the sectors are also written concurrently 
to the centralized storage. Read requests are satisfied by 
first checking the write store, then the snapshot if it ex- 
ists, and finally the read store. At each layer, if a sector is 
valid as specified in the bitmap, the data can be returned 
immediately from that layer. If none of the stores contain 
valid data, the sectors are read from the centralized stor- 
age, written to the read store, and returned to the client 
VM. 












Virtual 
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Figure 9: Capo’s virtual disk cache components and 
snapshot procedure. 


The snapshot mechanism in the cache works in tan- 


dem with a transactional update mechanism in the back- 
end storage to ensure crash consistent updates to remote 
disk images when operating in write-back mode. Push- 
ing updates to the backend storage involves three steps, 
as shown in Figure 9. First, a write cache snapshot is 
created by pausing the request stream momentarily and 
moving the contents of the cache’s write store to the 
snapshot store. Secondly, the snapshot contents are ap- 
plied transactionally to the centralized storage and to the 
read cache concurrently. Finally, after the snapshot up- 
dates have been applied, the snapshot store is cleared. 

The cache is implemented as a user-level shared li- 
brary that interposes on I/O calls, specifically the glibc 
and libaio I/O and file management operations. Due to 
the relative sizes of the disks in virtual desktops (around 
10GB) and the disks in the physical machine which hosts 
them (greater than 1TB) and the amount of sharing be- 
tween virtual desktops (see Figure 6), we can easily sup- 
port hundreds of virtual desktops on a server without 
worrying about overfilling the cache. In our implemen- 
tation, when the cache does fill, we simply throw it away 
and start again. 


4.2 Miulti-host Cache Preload 


Capo’s local persistent cache goes a long way towards 
eliminating redundant read requests on individual ma- 
chines. But as growing VM deployments lead to larger 
numbers of physical hosts, redundant reads across these 
hosts place additional burden on central servers. Fur- 
ther scalability improvements can be attained in this case 
by multicasting common data to all hosts simultaneously 
rather than to each host individually. 

To this end, we have developed a multicast cache 
preloader for local caches. The preloader is completely 
lock-free and requires no modifications to existing cen- 
tralized servers. It consists of a service which observes 
network traffic to and from the central storage server. 
Clients on each host contact the service and register 
watches for files which are determined to be good can- 
didates for preloading. The service captures any reads 
made to these files and distributes the results to all sub- 
scribed clients via multicast. In this way, the first host to 
read common data essentially prefetches it for all other 
hosts. 


4.2.1 Design and Implementation 


Our initial design for the preload server was to use a mir- 
ror port on the central storage server to monitor network 
traffic. As in Ditto [5], our server captured raw network 
packets and reconstructed TCP flows to extract relevant 
data (in our case, NFS requests and responses). When 
deploying this solution, however, we observed signifi- 


FAST 711: 9th USENIX Conference on File and Storage Technologies 


37 


38 


cant packet loss between the mirror port on the filer and 
our server, and since a single packet loss is enough to cor- 
rupt an entire NFS request or response, we were missing 
many opportunities to preload data. 

Our second, and current, design employs a user-level 
NFS proxy that sits between the clients and the filer. NFS 
requests and responses are routed through the proxy, and 
the proxy identifies data that should be preloaded into 
other local persistent caches. This increases the latency 
of filer requests somewhat, but avoids all of the issues 
with packet capture. 

In the current implementation, NFS clients are left 
unmodified. Instead, a single preload client runs on 
each physical host. On startup, these processes regis- 
ter watches with the server for files known to be shared 
across hosts. Because this data is predominately read- 
only, no synchronization is required when multicast 
clients update the local caches. When the preload server 
observes reads to files being watched by clients, it multi- 
casts the responses to all clients. 

Because NFS clients are unmodified, reads of shared 
files result in two responses: the unicast response to the 
original requester, and a second multicast response to all 
subscribed clients. This leads to an increase in over- 
all read bandwidth consumption from the proxy to the 
clients, but reduces the load on the storage server. The 
redundant unicast response could easily be avoided by 
making NFS clients aware of the multicast service. 

We also currently prioritize unicast responses over 
multicast responses. This limits the latency overhead 
seen by NFS clients while delaying preloading on other 
clients, making it slightly more likely that they will sub- 
mit unicast requests for the same data. With modified 
NFS clients, we could more viably prioritize multicast 
responses, improving the efficacy of preloading. 

The preload server sends a significant amount of traf- 
fic over a number of multicast sessions, and has exposed 
problems with the support for multicast in some mod- 
ern switches. On some of the switches that we have 
experimented against, multicast packets appear to con- 
sume a disproportionately large amount of resources. As 
a result, even relatively low-throughput multicast traf- 
fic has resulted in packet drops with detrimental conse- 
quences for concurrent TCP connections. The results can 
be dramatic: early experiments with completely unthrot- 
tled multicast traffic resulted in NFS throughput drops 
from 100MB/sec to 3MB/sec. 

To address this, we have implemented a rudimentary, 
adaptive flow-control protocol, similar to one described 
in SnowFlock [13]. Each packet sent by the server is as- 
sociated with an epoch. The server periodically updates 
the epoch number, and when clients notice a new epoch, 
they send a message indicating the number of packets 
successfully received during the previous epoch. In this 
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way the server gets feedback about packet drop rates and 
is able to vary transmission rates accordingly. An ad- 
ditive increase/multiplicative decrease scheme with ag- 
gressive back-off has produced reasonable results in our 
benchmarks. 

This flow-control protocol — and preloading in general 
— is strictly best-effort: no work is wasted trying to re- 
transmit dropped multicast packets. If the preload clients 
fail to receive multicast updates for required data, it will 
eventually be fetched via the conventional unicast path. 

The client logic for deciding which files to preload is 
simplified by a few basic design principles. We assume 
that, given a number of VMs derived from a common 
master image, reads of the base image made by any indi- 
vidual VM will likely be duplicated by all VMs. That is, 
while the disks belonging to derived VMs will tend to di- 
verge as the VMs age, the common portion of these disks 
will likely be read by all or none of the VMs. Thus if 
the multicast server observes any read of a common file, 
it is worth sending this data to all hosts on which Capo 
is caching this file. By the same assumption, multicast 
clients do not pro-actively request data from the server, 
as they are not in a position to know which portions of 
files will be read by VMs. 


4.3 Differential Durability 


Major VDI providers have all adopted the software up- 
date strategy proposed in The Collective [3], where user 
directories are isolated from the rest of the file sys- 
tem. Modifications made to files in the user directories 
must be durable; users depend on these changes. Capo 
therefore uses write-through caching on these directo- 
ries, propagating all changed blocks immediately to the 
centralized storage servers. Any modifications to the sys- 
tem image can then be performed on all VMs in one step 
by completely replacing the system images in the entire 
pool, leaving the user’s data unmodified. This impacts 
durability — any writes to the system portion (e.g., by up- 
dating the registry or installing software) will be lost. In 
this section we use and extend this notion to optimize for 
our write-heavy workload. 


4.3.1 Write-Back Period 


As mentioned in Section 2.2, VDI deployments man- 
age system data centrally, regularly replacing the system 
data seen by each virtual desktop with a clean updated 
version. While users are allowed to make changes to 
their system data, these changes are not guaranteed to 
be durable. Writes to the \Program Files directory as 
part of an application install process, for example, rep- 
resent work done by a user, but a software installation 
could easily be repeated if a failure caused this to be nec- 
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Path 

\Program Files\ 

\WINDOWS\ 

\Users\ ProgramData\ VMware\VDM\logs 
\Users\$USER$\ntuser.dat 
\Users\$USER$\ AppData\local 
\Users\$USER$\ AppData\roaming 
\pagefile.sys 

\ProgramData\ Sophos 


\Temp\ 





\Users\$USER$\ AppData\Local\Microsoft\Windows\Temporary Internet Files 
Everything else, including user data and FileSystem metadata 


Policy 
write-back 
write-back 
write-back 
write-back 
write-back 
write-back 

no-write-back 
no-write-back 
no-write-back 
no-write-back 
write-through 








Table 2: Sample cache-coherency policies applied as part of durability optimization. 


essary. It might be acceptable if the loss of such effort 
was limited to, for example, an hour or even a day. We 
can set our write-back period for such partially durable 
files to a corresponding length of time. 


4.3.2 Extending Partial Durability to User files 


While much of the data on the User volume is impor- 
tant to the user and must have maximum durability, Win- 
dows, in particular, places some files containing system 
data in the User volume. Examples include log files, 
the user portion of the Windows registry, and the local 
and roaming profiles containing per-application configu- 
ration settings. Table 2 shows some paths on User vol- 
umes in Windows that can reasonably be cached with a 
write-back policy and a relatively long write-back period. 


4.3.3 Eliminating Write-Back 


There are some system files that need not be durably 
stored at all. These include files that are discarded on sys- 
tem restarts or can easily be reconstructed if lost. Writes 
to the pagefile, for example, represent nearly a tenth of 
the total throughput to centralized storage in our work- 
load. These writes consume valuable storage and net- 
work bandwidth, but since the pagefile is discarded on 
system restart, durably storing this data provides no ben- 
efit. The additional durability obtained by transmitting 
these writes over a congested network to store them on 
highly redundant centralized storage provides no value 
because this data fate-shares with the local host machine 
and its disk. Many temporary files are used in the same 
way, requiring persistent storage only as long as the VM 
is running. 

We store this data to local disk only, assigning it a 
write-back cache policy with an infinitely long write- 
back period. In the event of a hardware crash on a phys- 
ical host, the VM will be forced to reboot, and the data 


can be discarded. 


4.3.4 Design 


Initially, we approached the problem of mapping these 
policies to write requests as one of request tagging, in 
which a driver installed on each virtual desktop would 
provide hints to the local cache about each write. While 
this approach is flexible and powerful, maintaining the 
correct consistency between file and filesystem metadata 
(much of which appears as opaque writes to the Master 
File Table in NTFS) under different policies is challeng- 
ing. Instead, we have developed a simpler and better per- 
forming approach using existing filesystem features. 

The path-based policies we use in our experiments can 
be seen in Table 2; naturally, these may be customized by 
an administrator. We provide these policies to a disk op- 
timization tool that we run when creating a virtual ma- 
chine image. The optimization tool also takes a popu- 
lated and configured base disk image. For each of the two 
less-durable policies, it takes the given path and moves 
the existing data to one of two newly-created NTFS file 
systems dedicated to that policy. It then replaces the path 
in the original file system with a reparse point (Window’s 
analogue of a symbolic link) to the migrated data. This 
transforms the single file system into three file systems 
with the same original logical view. Each of the three 
file systems are placed on a volume with the appropriate 
policy provided by the local disk cache. This technique 
is similar to the view synthesis in Ventana [18], though 
we are the first to apply the technique with a local cache 
to optimize performance. 

We appreciate that applying different consistency poli- 
cies to files in a single logical file system may be contro- 
versial. The risk in doing so is that a crash or hardware 
failure results in a dependency between a file that is pre- 
served and a file that is lost. Such a state could lead to 
instability; however, we are aware of no dependencies 
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crossing from files with high durability requirements to 
those with lower durability requirements in practice. Fur- 
ther, we observe that this threat already exists in the pro- 
duction environment we studied, which overwrites sys- 
tem images with a common shared image on a weekly 
basis. 


5 Evaluation 


To evaluate the effectiveness of Capo, we first consider 
how effective differential durability is at reducing write 
load from unimportant regions of disk. Next, we show 
the storage reduction achieved by Capo with eleven con- 
current users synthesizing active desktop workloads. Fi- 
nally we show the storage reduction achieved by Capo 
by replaying I/O logs gathered from a production system 
(see Section 3) under different caching policies. 


5.1 Differential Durability 


This section describes several microbenchmarks that 
evaluate the effectiveness of differential durability in iso- 
lation of other features and provide a clearer mapping 
of end-user activity to observed writes. We applied the 
policies in Table 2 to several realistic desktop workloads. 
For each, we measured the portion of write requests that 
would fall under each policy category. 
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Figure 10: Percentage of writes in three microbench- 
marks organized by governing cache-coherence policy. 


5.1.1 Web Workload 


Our web workload is intended to capture a short burst of 
web activity. The user opens www.facebook.com with 
Microsoft Internet Explorer, logs in, and posts a brief 
message to their account. They then log off and close 
the browser. The entire task lasts less than a minute. The 
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workload consists of 8MB (43.6% by count) of writes 
and 25.3MB (56.4% by count) of reads. 

A breakdown of writes by their associated policy for 
each workload is shown in Figure 10. In this short work- 
load only a small, but non-trivial improvement can be 
made. Local configuration changes such as registry, temp 
file, and cache updates need not be written immediately, 
removing or delaying just over 20% of the operations. 


5.1.2 Email Workload 


Our email workload is based on Microsoft Outlook. The 
user sends emails to a server we have configured to au- 
tomatically reply to every message by sending back an 
identical message. Ten emails are sent and received in 
succession before the test ends. The workload consists 
of 63MB (39% by count) of writes and 148MB (61% by 
count) of reads. 

Here the improvement is much more substantial. Al- 
though very few writes can be stored to local disk in- 
definitely, over half can be delayed in writing to central- 
ized storage. This is due to Outlook’s caching behavior, 
which makes heavy use of the system and application 
data folders. Emails in the .pst file are included in the 
user category. It is worth noting that many files in the 
windows and application data are obvious temp files, but 
did not match our current policies. With more careful 
tuning, the policies could be further optimized for this 
workload. 


5.1.3 Application Workload 


Our application workload is intended to simulate a sim- 
ple editing task. We open Microsoft Word and create a 
new document. We also open www. wikipedia.org in Mi- 
crosoft Internet Explorer. We then proceed to navigate 
to 10 random Wikipedia pages in turn, and copy the first 
paragraph of each into our word document, saving the 
document each time. Finally, we close both programs. 
The workload consists of 120MB (20.0% by count) of 
writes and 406MB (80.0% by count) of reads. 

Viewing many small pages creates a large number of 
small writes to temporary files and memory pressure? in- 
creases the pagefile usage. Both programs write signif- 
icantly to system folders, leaving less than 36% of the 
workload to be issued as write-through. 


5.2 Multi-host Cache Preload 


To evaluate the effectiveness of Capo’s multi-host 
prefetching we boot three Windows XP VMs on three 
different hosts. The experiment first fully boots one VM 


2The guest was running Windows Vista with 1GB of RAM, 25% 
higher than the XenDesktop recommended minimum. 
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before booting the other two VMs concurrently, with the 
intention of demonstrating that the reads triggered by the 
boot on the first host are sufficient to achieve a savings 
for the later boots. 

Figure 11 shows the read workload observed at 
the server in three different cache configurations: No 
cache, Write-through and Write-through with multi-host 
preload. Notice that the read workload for booting the 
two VMs is roughly double that of booting a single 
VM for both the no cache and write-through configura- 
tions. On the other hand write-through with multi-host 
prefetching almost eliminates the workload due to boot- 
ing the two later machines. 
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Figure 11: Read IOps per second for booting three VMs 
on three different hosts. 


5.3. Synthetic Workload 


To evaluate Capo as a whole, we arranged to simulate 
a set of (very) active desktop users, performing similar 
workloads to those seen in the trace. Figure 13 shows 
the results of request traffic hitting both the local caches 
(in aggregate across all images) and the filer, while 11 
users actively use a variety of office and web-based ap- 
plications. 

First, note that the load in this case is higher than any 
of the peaks seen in the trace data. This workload rep- 
resents a higher level of aggregate storage activity than 
was ever seen in the production environment. Second, 
observe that despite being configured conservatively for 
complete write back, Capo reduces all peaks in the stor- 
age request load. 


5.4 Trace Replay 


To evaluate the benefits of deploying Capo in a real world 
setting, we replay the collected I/O traces (see Section 3) 
using different disk caching policies. The next sections 
describe our experimental setting, analyze our replayer 
fidelity, and present the replay results. 
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Figure 13: IOps per second for a workload of 11 Win- 
dows users on a XenCenter Cluster. 


5.4.1 Experimental Setting 


The test environment consists of four physical machines 
which serve as hosts for the virtual machines that replay 
requests from the recorded trace, and a filer to serve as 
a backend storage for these virtual machines’ disks. The 
filer runs Linux’s default kernel NFS server to host an 
XFS volume built on top of a RAID 0 consisting of six 
disks. The host machines run XenServer 5.6 and store 
their local caches in an ext3 volume on top of a RAID 
configuration similar to that of the filer. The machines 
are connected using a 1Gb Ethernet switch. 

We replayed the workload of each desktop for which 
we had collected traces in a distinct virtual machine on 
one of the XenServer hosts. As it is impractical to replay 
the entire week’s trace for each configuration, we choose 
to focus on the six peak regions identified in Section 3. 

Entirely isolating our analysis to the peak regions 
would start each replay with an empty cache. Instead, 
we accurately recreated the state the cache would be in at 
the start of each region by priming it with the data from 
whole trace up to that point. This includes any write- 
back blocks that would have been pending. The write- 
back interval was set to ten minutes for the write-back 
and differential durability policies. 


5.4.2 Replay Fidelity 


Both the hosts and the storage server in our replay exper- 
iment are different from those in the original system from 
which we collected the traces. We satisfied ourselves that 
the replay is representative by measuring the observed 
load during a simple replay without any caching. Fig- 
ure 12 plots the fourth selected time period’s I/O oper- 
ations per second as observed at a number of different 
points in the I/O stack of the experimental environment. 

The Trace line represents the aggregate workload as 
observed in the original trace. The Replay line represents 
the rate at which the replayer issues I/O requests to the 
system as observed at the replay clients. These two lines 
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Figure 12: Replay fidelity and resulting load on the server. 


Peak IOps / Reduction in peak IOps compared to No Cache configuration 


Time Write Through 
Period Tota 


1 2307 / 100% | 262590/ 100% 893 / 38% 155757/59% | 670/29% 67798 / 25% 712 /30% 91877 / 34% 
2516/ 100% | 561894 / 100% 937 / 37% 319936 / 56% | 671/26% | 113184/ 20% 903 / 35% 161737 / 28% 
1302/ 100% | 143468 / 100% 876 / 67% 126049 / 87% | 595/ 45% 43455 / 30% 84044 / 58% 
1887/ 100% | 450914/ 100% 910 / 48% 334089 / 74% | 595/31% | 131064/ 29% 271529 / 60% 
1214/ 100% | 159736/ 100% 890 / 73% 141656 / 88% | 704/57% 45868 / 28% 75500 / 47% 
2185 / 100% 72082 / 100% 1155 /52% 66668 / 92% 910/ 41% 29086 / 40% 42895 / 59% 


802 / 61% 
849 / 44% 
841 / 69% 
1368 / 62% 


100% 100% 52.5% 38.1% 28.6% 50.1% 47.6% 





Table 3: Peak and Total IOps workload observed at the file server during the replay of time periods of interest under 
different caching policies. Each peak or total IOps value is followed by its ratio relative to its corresponding value 
observed with no cache deployed. The last row represents the average reduction of the metric across the six time 
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periods of interest. 


are almost indistinguishable in the figure which indicates 
that the timing of our replayer is accurate. 

The Server line represents the load observed at the 
server. Notice that this load is lighter than the aggregate 
trace load, largely due to coalescing requests in the stor- 
age stack of the XenServer hosts. The VHD line repre- 
sents the load observed at the server when image files are 
stored in the Microsoft VHD format. Notice that VHD 
adds significant overhead to the workload; most of this 
overhead is due to meta data management. 

We draw two observations from this evaluation. First, 
our replay client is able to match the request issue rate of 
the original trace with high fidelity. Second, because of 
transformations that result from both the XenServer stor- 
age stack and the underlying VM image format, the load 
experienced at the storage target may be dramatically dif- 
ferent from that measured at the client. In evaluating our 
cache under replay in the next subsection, we first replay 
with no caching involved to establish a baseline load at 
the filer, and then compare caching configurations to this 
baseline. 


5.4.3 Replay Results 


We replayed the 6 periods of intense workload identi- 
fied in Figure 2 using four different cache configurations. 
These cache configurations are no cache, write-through, 
write-back and differential. Figure 14 plots the IOps ob- 
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served at the server using each configuration. As ex- 
pected, differential durability represents a compromise 
between the load reduction realized by write-back and 
completely protecting important user data. 

Table 3 summarizes the peak and total I/O workload 
reductions for the time periods of interest. The write- 
back policy applied to the entire disk was the best in re- 
ducing I/O workload. On average it reduced the peak 
and total I/O workload to 38.1% and 28.6% of that with- 
out any caching in place. The differential durability pol- 
icy goes further to protect user files, and still reduced the 
peak and total I/O workload down to 50.1% and 47.6% 
on average. Finally, as expected the write-through policy 
had the worst average peak and total workload reductions 
of 52.5% and 76%. 


6 Related Work 


The caching component of Capo is most closely related 
to the ITC [19], Andrew [10], and Coda [12] file systems 
which utilize the local disk as a cache for whole files 
retrieved from servers. The Cedar file system [21] allows 
users to share immutable files over the network; by only 
supporting immutable files Cedar eliminates the need for 
cache consistency management. 

Unlike these distributed file system caches, Capo op- 
erates at the block level. Cache consistency management 
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Figure 14: IOps per second observed at the filer for replays of selected periods of interest under different cache 
configurations. 
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is simplified by the fact that each virtual disk has a sin- 
gle writer and copy-on-write is used to prevent updat- 
ing shared data. As in Cedar, shared data is always im- 
mutable. 

Fs-cache [11] and iCache [9] are systems that, like 
Capo, implement block-level caching for remote stor- 
age systems: a file system in the case of Fs-cache and 
an iSCSI target in the case of iCache. Capo extends 
the basic block caches of these systems using a host 
cache shared by all the VMs on a host, the multi-host 
prefetcher, and differential durability for files. All of 
these features are inspired by our target environment of 
supporting virtual desktops. 

Capo’s use of write-back caching reduces the demand 
placed on the central storage facility in a manner sim- 
ilar to that of Everest [15]. Where Everest replicates 
offloaded write requests to tolerate disk failures, Capo 
uses a technique similar to Snapmirror [16] to periodi- 
cally push self-consistent updates across the network for 
data that is cached in write-back mode. 

Other researchers have studied the performance of 
storage in virtualized environments. In particular, Gu- 
lati et al. [7] study the storage demands of enterprise ap- 
plications in virtualized environments. In contrast, our 
study of virtual desktops provides insight into the unique 
characteristics of this emerging use of virtualization. 

SnowFlock [13] provides a fork abstraction to instan- 
taneously replicate stateful virtual machines to scale up 
computations in the cloud easily. Similar to our multi- 
host cache preloader, SnowFlock uses multicasting to 
replicate the persistent (disk) and non-persistent (mem- 
ory) state of the cloned virtual machines. 

Agrawal et al. [1] and Bolosky et al. [2] collect and an- 
alyze snapshots of Desktop machine’s file system meta- 
data over long periods of time. This kind of analysis 
restricts I/O workload analysis to mean estimates and 
doesn’t capture the dynamic characteristics of Desktop 
I/O such as burstiness. In this work we focus on captur- 
ing detailed block level I/O operations to better under- 
stand the variation of Desktop I/O workloads in time. 

Lithium [8] gives up centralization in favor of distri- 
bution to provide scalable storage for virtual machines. 
To improve availability of data, Lithium replicates disk 
updates to remote hosts either synchronously or lazily 
(eventual consistency). These two replication policies 
are synonymous to Capo’s write-through and write-back 
caching policies. However, Lithium’s treatment of repli- 
cation consistency is more complicated due to its dis- 
tributed nature. 


7 Conclusion 


Enterprise storage provides considerable benefit to vir- 
tual environments. However, for applications such as 
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virtual desktops, which involve large numbers of nearly 
identical images running concurrently, a large portion of 
the request load placed on shared storage is unnecessary. 
After analyzing a one-week trace of a production VDI 
deployment, we presented Capo, a distributed and per- 
sistent cache which reduces the aggregate load placed on 
shared storage. Capo uses local disks on individual phys- 
ical servers to cache image contents for the VMs being 
hosted. It includes mechanisms to share common cached 
base images across VMs, and to prefetch caches across 
physical hosts. In addition, Capo supports a configurable 
degree of differential durability, allowing administrators 
to relax the durability properties and the associated write 
load of less-important subsets of a VM’s file system. 
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Abstract 


This work analyzes the stochastic behavior of writing to 
embedded flash memory at voltages lower than recom- 
mended by a microcontroller’s specifications to reduce 
energy consumption. Flash memory integrated within a 
microcontroller typically requires the entire chip to op- 
erate on common supply voltage almost double what the 
CPU portion requires. Our approach tolerates a lower 
supply voltage so that the CPU may operate in a more en- 
ergy efficient manner. Energy efficient coding algorithms 
then cope with flash memory that behaves unpredictably. 
Our software-only coding algorithms (in-place writes, 
multiple-place writes, RS-Berger codes) enable reliable 
storage at low voltages on unmodified hardware by ex- 
ploiting the electrically cumulative nature of half-written 
data in write-once bits. For a sensor monitoring applica- 
tion using the MSP430, coding with in-place writes re- 
duces the overall energy consumption by 34%. In-place 
writes are competitive when the time spent on computa- 
tion is at least four times greater than the time spent on 
writes to flash memory. Our evaluation shows that tightly 
maintaining the digital abstraction for storage in embed- 
ded flash memory comes at a significant cost to energy 
consumption with minimal gain in reliability. 


1 Introduction 


Billions of microcontrollers appear in embedded systems 
ranging from thermostats and utility meters to tollway 
payment transponders and pacemakers!. Recent years 
have witnessed a proliferation of low-power embedded 
devices [2, 7, 17, 21], many of which use on-chip flash 
memory for storage. 

While the reliability, low cost, and high storage den- 
sity of flash memory make it a natural choice for embed- 
ded systems [15], its relatively high voltage requirement 
(Table 1) introduces challenges for energy-efficient de- 


'A single manufacturer claims to have shipped over 8 billion mi- 
crocontrollers http://www.microchip.com/sec/annual/FY 10/. 


signs aiming to maximize the system’s effective lifetime 
(e.g., the run time on a typical battery whose voltage 
declines over time). Instrumenting the system to oper- 
ate at a fixed low voltage v; is one way to reduce power 
consumption; however, achieving consistently correct re- 
sults for flash writes are guaranteed only if v; is higher 
than a manufacturer-specified threshold. Moreover, in 
energy-limited devices that cannot provide a constant 
supply voltage, scenarios may arise in which the flash 
memory is the only part of the circuit whose operating 
requirements are not met. In such cases, applications can 
expect normal operation when they are not performing 
flash writes and unpredictable behavior when they are. 














Microcontroller CPU Flash write 
Min. voltage | Min. voltage 
TI MSP430 [36] 1.8V 2.2 or 2.7 V | 
PIC32M [24] 2.3V 3.0 V | 
ATmegal28L [3] 2.7V 45 V | 














Table 1: Flash memory restricts choices for the CPU 
voltage supply on microcontrollers because the CPU 
shares the same power rail as the on-chip flash memory. 


Because embedded flash memory typically shares a 
common voltage supply with the CPU (separate power 
rails are cost prohibitive), a single voltage must be cho- 
sen that satisfies different components with different 
minimum voltage requirements. Current embedded sys- 
tems address the voltage limitations of flash memory in 
one of the following ways: 

i) A system can choose a high supply voltage suffi- 
cient for both reliable writes to flash memory and reliable 
CPU operation. This is a common choice for embed- 
ded systems with on-chip flash memory, but causes the 
CPU to consume more energy than necessary. For exam- 
ple, the TI MSP430F2131 microcontroller [36] in active 
mode consumes almost double the power when operat- 
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Figure 1: Operating at a lower voltage and tolerat- 


ing errors instead of the conventional case of choos- 
ing the highest minimum voltage requirement may 
help decrease energy consumption. Considering that 
Energy = voltage” x time/resistance, decreasing volt- 
age decreases the energy consumption quadratically. 


ing at 2.2 V instead of 1.8 V. Its onboard flash memory 
requires 2.2 V for reliable writes to flash memory. 

ii) A system can choose a low supply voltage sufficient 
for CPU operation, but insufficient for reliable writes to 
flash memory. This choice allows the energy source to 
last longer and for the CPU to compute more efficiently. 
An example of such a system is the Intel WISP [33], 
a batteryless RFID tag that sets its operating voltage to 
1.8 V—below its onboard flash memory’s 2.2 V spec- 
ified minimum—to save power. Flash memory cannot 
be written on this device. The microcontroller could use 
a low-power wireless interface (e.g., RF backscatter) to 
store data remotely. Such an approach, however, raises 
privacy as well as performance concerns [32]. 

iii) A system can modify hardware to enable dy- 
namic voltage scaling. This approach requires additional 
analog circuitry such as voltage regulators and GPIO- 
controlled switches. Because many embedded systems 
are extremely cost sensitive, this choice is unattractive 
for high-volume manufacturing with low per-unit profit 
margins. An additional 50 cent part on a thermostat con- 
trol can be cost prohibitive. Moreover, small changes 
may necessitate a new PCB layout—upsetting the deli- 
cate supply chain and invalidating stocked inventories of 
already fabricated PCBs. 


Approach. Our approach reduces the operating volt- 
age of the microcontroller to a point at which the result- 
ing energy savings of the CPU portion of the workload 
exceeds the energy cost of the algorithms for ensuring 
reliable writes (Figure 1). The technique requires min- 
imal or no hardware modification and also allows for 
RFID-scale devices to better exploit capacitors as power 
supplies. The capacitor provides finite energy and there- 
fore the voltage decays exponentially. The long tail of 
the curve provides insufficient voltage for conventional 
writes to flash memory, but is sufficient for reliable stor- 
age with our techniques. 
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Of wits and half-wits. In 1982, Rivest and Shamir in- 
troduced the notion of write-once bits (wits) in the con- 
text of coding theory to make write-once storage behave 
like read-write storage [31]. Bits in flash memory be- 
have like wits because a programmed bit cannot be re- 
programmed without calling an energy-intensive erase 
operation to a block of memory much larger than a sin- 
gle write. We coin the term half-wits to refer to wits used 
in a manner inconsistent with a manufacturer’s specifica- 
tions, resulting in stochastic behavior. Half-wits in this 
work are wits of flash memory used below the recom- 
mended supply voltage. 

In examining error rates at low voltage and construct- 
ing a system that provides reliable storage despite errors, 
our work suggests that it is appropriate to relax previ- 
ously assumed constraints and reexamine the costly dig- 
ital abstractions layered above on-chip flash memory. 


Contributions. Our primary contributions include an 
empirical evaluation that characterizes the behavior of 
on-chip flash memory at voltages below minimum lev- 
els specified by manufacturers, and algorithms that en- 
able reliable writes to flash memory while coping with 
low voltage. Our evaluation identifies three key factors 
affecting error rates: voltage, Hamming weight of the 
data, and the wear-out history of the flash memory. 

The first algorithm, in-place writes, makes attempts at 
write time to store a value correctly in the given memory 
address. The in-place writes method repeatedly writes 
data to the same memory address. The intuition behind 
this approach is that repeating a write attempt in a con- 
sistent location accumulates the charge in the same cell, 
increasing the chance of storing a bit of information cor- 
rectly. In addition, since flash writes only change bits 
in a single direction, a correctly written bit cannot be re- 
versed to produce an error on a second write attempt. The 
second algorithm, multiple-place writes, tries to decrease 
the probability of error by making attempts at both write 
time and read time. This method stores data in more than 
one location aiming that the data (even partially) will be 
stored correctly in at least one of these locations. The 
third algorithm is a hybrid error-correcting code combin- 
ing Reed-Solomon (RS) [29] and Berger [5] codes. The 
Berger code detects, but does not correct, asymmetric er- 
rors caused by the low write voltage. Given the approx- 
imate locations of errors, which are determined by the 
Berger code, the RS code efficiently recovers the origi- 
nally stored data. 

The paper compares all three methods in terms of en- 
ergy consumption, execution time, and error correction 
rate. We also show that our methods are most effective 
for CPU-bound workloads. With respect to cost and en- 
ergy, our techniques may enable already deployed em- 
bedded flash memory to remain competitive with emerg- 
ing technology for low-power, non-volatile memory. 
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Figure 2: As operating voltage decreases, flash-write errors increase. (a) shows an original ECG signal correctly 
stored at 2.0 V (despite operating below the recommended threshold). As the voltage decreases in (b) and further 
in (c), erroneous writes (light-colored spikes, height varying according to the magnitude of the error) become more 
common. The black line shows the reconstructed signal that includes the errors. 


2 Behavior of Storage on Half-wits 


Before we can design effective coding algorithms, we 
must first understand the behavior of errors in half-wits. 
By tolerating a lower voltage, an energy-limited em- 
bedded device can decrease its power consumption and 
therefore extend its lifetime on a finite energy supply”. 
The minimum operating voltage of embedded devices 
that use nonvolatile on-chip storage is usually deter- 
mined by the requirements of flash memory. For exam- 
ple, the TI MSP430 microcontroller can operate at 1.8 V, 
but its nominal minimum voltage for flash writing and 
erasure is 2.2 V (Table 1). Increasing operating voltage 
from 1.8 V to 2.2 V causes the CPU to draw about 50% 
more power without commensurate gain in clock speed 
because of the voltage squaring effect. 

The drawback of lowering voltage below flash mem- 
ory requirements in order to save power is the loss of 
flash memory reliability. Figure 2 shows the result of 
running a MSP430F2131 at three different voltages— 
all lower than the nominal minimum for flash writes— 
to store electrocardiogram (ECG) data samples from the 
PhysioNet database [13] in flash memory. Many medical 
sensor networks [20, 22, 34] that provide ECG measure- 
ments are energy limited and use on-chip flash memory 
as primary storage. 

These graphs support the intuition that flash writes 
may not be error free at low voltages and that there exist 
voltage levels below the minimum recommended voltage 
at which flash writes function correctly?. To investigate 
the behavior of flash memory at low voltage and deter- 
mine the factors influencing the error rate, we performed 
experiments on an automated testbed of our own design. 


2Or because of relaxed requirements, eliminate the need for multi- 
ple batteries in series to achieve a high voltage. 

3Moreover, a nonzero error rate may be tolerable by some appli- 
cations. In the case of ECG data, the cardiac pulse interval can be 
recovered from noisy data stored at low voltage. 


2.1 Experimental Methodology 


We use a consistent experimental setup for all of the ex- 
periments in this work. Our choice of test platform is a TI 
MSP430 [36] microcontroller with on-chip flash mem- 
ory. More specifically, we tested two types of TI mi- 
crocontrollers: MSP430F2131 and MSP430F1232. The 
MSP430 is common in low-power embedded applica- 
tions; we note especially its use in sensor motes [28] 
and RFID-scale batteryless devices [33]. In our setup, 
an MSP430 microcontroller runs a test program that in- 
volves both computation and flash operation. We power 
the microcontroller with an external power supply held 
steady at a voltage below the nominal minimum for flash 
writes. An external chip captures the contents of flash 
memory after each experiment. 

To automate the testing of flash write behavior, we 
have developed a flash memory testbed. The two major 
components of the testbed are a test platform and a con- 
nected monitoring platform. The monitoring platform is 
based on an additional MSP430 microcontroller. The test 
platform runs a test program at low voltage. When the 
test program completes, the test platform sends the result 
of the experiment to the monitoring chip via GPIO pins. 
The test and monitoring platforms share 8+1 GPIO pins 
to carry one byte of data and a clock signal. Once the 
test platform puts data on its eight data pins, it raises the 
clock pin. The monitoring chip reads data from its GPIO 
pins whenever it detects a rising clock signal and logs 
the results in its own flash memory. The monitoring chip 
runs at a voltage above the nominal minimum for its own 
flash writes, and therefore stores reliably. 


2.2 Unreliable, Low-Voltage Flash Memory Writes 


The TI MSP430 datasheet [36] states that flash writes 
at any voltage lower than the nominal minimum volt- 
age (which is 2.2 V in the case of MSP430F2131) are 
not guaranteed to succeed. However, as the graphs in 
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Figure 2 show, not all flash writes fail at low voltages. 
On the contrary, in this specific experiment, most of the 
writes (95.24% at 1.9 V and 89.88% at 1.8 V) succeed. 

In a NOR flash memory, all cells are initialized to 1, 
and writing data to a byte of flash memory means setting 
an appropriate number of bits to 0 by applying electri- 
cal charge to the corresponding flash cells. At low volt- 
age, there may be insufficient charge to effect a transi- 
tion to 0, and a flash write may store fewer O bits than 
requested [27]. To be specific, we define errors as fol- 
lows: when a byte of data d is written in a flash memory 
address and then data d> is read from that address, there 
is an error if dj 4 dy. An experiment, explained next, in- 
vestigates the behavior of low-voltage flash memory and 
gives bit-level results. 

Using the automated flash testbed explained in Section 
2.1, the test platform runs a program that writes numbers 
{0,---,255} to flash memory, then sends the contents of 
its flash memory to the monitoring platform via GPIO 
pins. Table 2 compares the written data and the intended 
data for cases in which errors occurred. It demonstrates 
that, when both are represented as integers, the absolute 
value of the stored data is always greater than or equal to 
the absolute value of the intended data. 


2.3 Determining Factors That Affect Error Rates 


We consider the following potential factors that may af- 
fect the error rate of setting a bit to 0 in a flash memory 
at low voltage: voltage level, Hamming weight of the 
data, wear-out history, permutation of Os, and neighbor 
cells. The effects of each of these variables are evalu- 
ated by designing an experiment to test a hypothesis. All 
the experiments are performed on flash memories with 
minimal previous usage unless stated otherwise. 

Voltage level: Our hypothesis is that the lower a chip’s 
operating voltage (and that of its on-chip flash memory), 
the higher the error rate of flash writes. Figure 3 confirms 
this hypothesis; moreover, the graph shows that for dif- 
ferent chips of exactly the same type, the error rate can 
be different even under equivalent voltage. 

Experiment: Two MSP430F2131 and two 
MSP430F1232 microcontrollers run a program that 
writes zeros to the data segment of their flash memory. 
We increased the microcontroller’s operating voltage 
in 10 mV steps, and used the monitoring platform to 
compute the byte error rates over 50 runs. 

Hamming weight: In an erased (i.e., having value 1) 
flash cell, writing a 1 is always error free because no 
change to the cell is necessary. However, setting a cell to 
0 might fail if there is not enough charge accumulated in 
that cell. Our hypothesis is that, the lower the Hamming 
weight (number of Is in the binary representation) of a 
number, the higher the probability of error when writing 
that number to flash at low voltage. 
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Figure 3: Flash write error rates decrease as volt- 
age increases. This trend holds for all the chips 
(MSP430F2131 and MSP430F1232) we tested, though 
error rates differ even between chips of the same model. 


Based on per-byte Hamming weight, there are nine 
equivalence classes of integers that can be represented in 
one byte. The weight-8 equivalence class has only one 
member, 255, which can always be written to an erased 
flash cell without error. The other extreme case is the 
weight-0 equivalence class, containing only Os, that re- 
quires all eight bits to transition to 0. Figure 4 shows the 
byte error rate for all nine equivalence classes, measured 
via the following experiment. 


100 


Error rate(%) 
a 
oO 


5 
Hamming weight 


Figure 4: As the Hamming weight (number of Is in the 
binary representation) of a number increases, the error 
rate of low-voltage flash write declines. The data corre- 
sponds to a MSP430F2131 running at 1.84 V. 


Experiment: At 1.84 V, a MSP430F2131 runs a pro- 
gram that writes numbers from the same equivalence 
class to one block (64 bytes) of flash memory. We used 
the monitoring platform to compute the average byte er- 
ror rate of flash writes for each of the nine equivalence 
classes over 50 runs. 

Corollary: To exploit the fact that the Hamming 
weight of a number affects probability of error when it 
is written to flash, one can transform numbers into num- 
bers with greater Hamming weights before writing them 
to flash memory. 

Wear-out history: Flash memory has a limited life- 
time (about 10° cycles of erasures) after which the erase 
operations fail to reliably reset the bits to 1. We sus- 
pect that the more flash memory is erased (worn-out), the 
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(Binary) Intended | 00001100 | 00001101 | 00001110 | 00010100 |) 00100111 | 10100100 
Written | 11101101 | OLOLI111 | 11111111 ) 11111111 | 00101111 | 10101111 
Hamming distance 4 3 5 6 1 3 




















Table 2: Erroneous flash writes at low voltage. Insufficient electrical charge may result in some bits failing to transition 


from 1 (the initial state) to 0. 


lower its error rate of setting bits to 0 would become’. 


Figure 5 shows a heat map of bit error rate for three 
blocks of flash memory (192 bytes) on an MSP430F2131 
microprocessor. Lighter colors in the heat map represent 
higher error rates. The disproportionately dark color of 
the middle block is due to more frequent erasure of that 
block compared to the other two blocks. 
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Figure 5: Worn-out flash memory blocks are biased to- 
ward ease of writing zeros. Lighter color represents 
higher average number of errors over 50 trials. The mid- 
dle block has been write/erase cycled 6,000 times. The 
other two blocks are minimally used. 


Experiment: A MSP430F2131 runs a program that 
writes zeros to all three blocks of its flash memory. The 
MSP430 is first worn out such that one block has 6,000 
write/erase cycles and two blocks have minimal previous 
usage. We used the monitoring platform to compute the 
average error rate for all bits in the three blocks of mem- 
ory over 50 trials. 

Corollary: Wear-out history affects error rate, so stor- 
ing data in more than one location might help decrease 
the error rate, especially if those locations are in different 
blocks of memory. 

Permutation of 0s: Two numbers belonging to the 
same Hamming-weight equivalence class can have dif- 
ferent permutations of 0 bits. We tested to see if the er- 
ror rate depends on the permutation of Os in one byte 
of data. For example, the numbers 240, 15, 170, and 
71 all have four Os in their binary representation but in 


4This counterintuitive hypothesis is consistent with the notion that 
flash erasures (settings bits to 1) become harder with wear out. 


different places (240 has Os in the right nibble, and 15 
has all of its Os in its left nibble, etc.). The result of 
the experiment shows a similar byte error rate with mean 
of 39.85 + 4.29% for numbers in the same equivalence 
class. The small standard deviation (4.29%) shows that 
the permutation of Os does not significantly affect the er- 
ror rate and therefore we do not consider this factor in 
our design directions. 

Experiment: A MSP430F2131 runs a program that cy- 
cles through eight numbers from the same Hamming- 
weight equivalence class, writing them to 192 consec- 
utive bytes of flash memory. We used the monitoring 
platform to compute the average error rates for each of 
the 192 bytes over 50 trials. 

Neighbor cells: Another factor that might affect the 
error rate of storage in a flash cell at low voltage is the 
values of neighboring cells. However, our results suggest 
that a cell’s error rate does not appear to depend on the 
values stored in neighboring cells (Figure 6). 
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Figure 6: Error rate of a cell is not noticeably influ- 
enced by the value of its neighbor. The graph shows that 
the value of the second LSB does not greatly affect the 
error rate of the LSB. The bars show the error rate of 
the LSB for writing numbers from the same Hamming- 
weight equivalence class whose two LSBs are set to ei- 
ther 00 (dark bars) or to 10 (light bars). 


Experiment: In order to determine if the error rate of 
a cell is affected by its neighbor, we consider all num- 
bers from the same Hamming-weight equivalence class 
whose two Least Significant Bits (LSBs) are set to either 
00 (case 1) or 10 (case 2). An example of case 1 is num- 
ber 60 (0600111100) and an example of case 2 is number 
30 (0600011110). This experiment fixes the Hamming 
weight variable and changes the neighbor value of the 
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LSB to be 0 or 1. We deem a write erroneous if the LSB 
is not set to 0. The experiment was done for a Hamming 
weight of four and it was repeated for five voltage levels 
in the interval of 1.82 V to 1.84 V with steps of 5 mV. 
The error rate for any voltage above 1.84 V was close to 
0% and for any voltage below 1.82 was close to 100%. 
We used the monitoring platform to compute the average 
error rates of case | and case 2 for each of the voltage 
levels over 50 trials. 


2.4 Accumulative Memory Behavior 


It is helpful to understand a few details of the electri- 
cal nature of flash memory in order to appreciate the 
expected behavior of conventional digital abstractions 
when layered above embedded flash memory. Each flash 
memory cell is a floating-gate (FG) transistor made up 
of a source, drain, control gate, and floating gate. The 
floating gate is separated from the source and drain by an 
insulating oxide layer that makes it difficult for electrons 
to travel into or out of the gate. Flash cells rely on this 
oxide to maintain logical state in the absence of power, 
making the memory non-volatile [27]. 

To write a memory cell (which has an erased value of 
1), the control circuitry applies a high field to the source. 
The application of this field greatly increases the proba- 
bility that electrons in the floating gate will tunnel to the 
source. If a sufficient number of electrons tunnel to the 
source, the cell is subsequently read as a 0. To erase a 
cell (restoring a 1), the control circuitry applies a high 
field to both the source and drain. This field energizes 
the electrons currently stored near the source, allowing 
them to jump the oxide barrier to the floating gate where 
they are once again trapped [27]. 

Not all electrons must transition in order for a write 
or erase operation to be successful. The operation only 
needs to change the state of some majority of the elec- 
trons so that subsequent read operations detect sufficient 
charge to discern the intended value. Lowering the ap- 
plied voltage (and thus the field strength) lowers the 
probability of state change for each electron but, as noted 
earlier, electrons that do transition will remain in place. 

A low-power storage scheme can benefit from this ac- 
cumulative property by repeating writes to the same cell. 
Each write operation will increase the chance of success 
by forcing some number of state transitions. In other 
words, a failed write is still progress. 


3 Design of a Low-Voltage Storage 


This section presents our design for a software system 
that enables reliable flash memory writes at low voltage. 
We first present a model that captures the basic character- 
istics and behavior of flash memory. We then set design 
goals with that model under consideration. We introduce 
three methods for reliable flash storage, which we refer to 
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as in-place writes, multiple-place writes, and RS-Berger 
codes. Each method aims to meet our design goals for 
reliable non-volatile storage. 


3.1 Modeling Low-Voltage Flash Memory 


A NOR flash memory has a set of n cells that are initially 
set to 1. We represent the state of the cells by c1,...,¢n; 
the value of c; can be 0 or 1. A cell can be set to 0 using 
a write operation. The 1 — 0 transition might fail at low 
voltage while the 1 — 1 will obviously succeed. Flash 
memory at low voltage, where errors occur only in one 
direction, can be modeled as a Z-channel [19]. 

Flash memory is a write-once memory [31] and once 
a cell is set to 0 (i.e., once it is programmed), it cannot be 
changed back to 1 without using an erase operation. In 
flash memory, cells are organized by blocks, and an erase 
operation resets an entire block of cells. Block erasures 
are costly in terms of time and energy and they cause 
wear to flash cells. 

Operations: There are two operations in this model: 
(1) An update operation that changes a subset of cells 
to 0 to represent a value, and (2) A decoding operation 
that maps cell states (i.e., memory state) to a value. Up- 
dating a variable means changing the values of c1,...,Cn 
to cj,...,c),. Assuming no erase operation occurs, and 
therefore no bits are reset to | after being set to 0, we 
have Vi € {1,...,n},c; > ci after an update. If the update 
operation is performed when operating voltage is below 
the nominal minimum required for flash memory, the up- 
date operation may not be error free. 


3.2 Design Goals 


Our storage techniques, which aim to provide reliable 
storage for low-power devices, are designed with the fol- 
lowing metrics in mind: 


e Error rate: The first and foremost design goal is to 
minimize the error rate to provide applications with 
reliable non-volatile storage. 


e Energy consumption: The energy consumed to 
achieve an acceptably low error rate should not ex- 
ceed the expected energy savings gained by running 
at a lower voltage. 


e Delay: We define delay as the difference between 
the execution time to reliably store data at a low 
voltage and to store the same data at a high voltage. 
The delay caused by the storage technique should 
be reasonably small. 


3.3. Proposed Methods 


Toward the design goals discussed previously, we pro- 
pose methods to deal with errors caused by using flash 
memory at low voltage. 
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3.3.1 In-Place Writes 


Since the transition of a 1 to a0 in a NOR flash memory 
at low voltage is stochastic rather than guaranteed, the 
in-place writes method repeats the write of each byte (to 
the same memory location) more than once if necessary, 
up to a threshold number of attempts. Algorithm 1 gives 
the details for ENCODE and DECODE procedures for in- 
place writes. 


Algorithm 1 The encoding and decoding algorithms for 
in-place writes method to store data to address by tre- 
peating the writes up to threshold a number of attempts 
if necessary. 





ENCODE (data, address, threshold) 


1 WRITE_TO_FLASH(data,address) 

2 result ~ READ_FROM_FLASH(address) 

3 repeat — 1 

4 while (result 4 data) AND (repeat < threshold) 
5 do WRITE_TO_FLASH(data,address) 

6 result — READ_FROM_FLASH(address) 
7 repeat — repeat+1 

DECODE(address) 


1 result — READ_FROM_FLASH(address) 
2 return result 





The reason in-place writes decrease the error rate is 
that, as explained in Section 2.4, each write attempt in 
the same memory location increases the accumulated 
charge and therefore raises the probability of storing the 
intended bit sequence successfully. 


3.3.2 Multiple-Place Writes 


Another approach to increase the reliability of flash 
writes at low voltage is to write a value to more than one 
location in flash memory if necessary up to a threshold 
number of locations. Later, to retrieve the stored data, 
the multiple-place writes method reads the data from the 
specified address and several other addresses associated 
with it, then returns the bitwise AND of all of the stored 
values. Algorithm 2 details ENCODE and DECODE pro- 
cedures of the multiple-place writes method. Writing a 
value to more than one memory location increases the 
probability of storing it successfully in the flash mem- 
ory. 

The reason the multiple-place writes approach can de- 
crease the error rate is as follows: All cells of flash mem- 
ory are initially set to 1. An error means that writing a 0 
has failed and a bit cell c; has remained untouched (log- 
ical 1) although it was intended to be set to 0. If the cell 
write in one of the locations has not failed, and cell c; is 0 


Algorithm 2 The encoding and decoding algorithms for 
multiple-place writes method to store data to address by 
repeating the writes up to a threshold number of loca- 
tions if necessary. The distance between each of these 
associated locations is offset. 





ENCODE(data, addr, threshold, offset) 


WRITE_TO_FLASH(data,addr) 

result — READ_FROM_FLASH(addr) 

repeat — | 

while (result 4 data) and (repeat < threshold) 

do phy_addr — addr + (repeat x offset) 

WRITE_TO_FLASH(data,phy_addr) 
n_result - READ_FROM_FLASH(phy_addr) 
result — result & n_result 
repeat — repeat+1 


OANADMNBRWN KE 


DECODE(addr, threshold, offset) 


1 for i— 0 to (threshold —1) 

2 do phy — addr + (i x offset) 

3 n_result - READ_FROM_FLASH(phy) 
4 result — result & n_result 

5 return result 





in at least one location, getting the AND of the read val- 
ues from all locations will make cell c; = 0 in the AND 
result. The case of writing a | to a cell does not cause an 
error since it means changing a cell from | to 1. 


3.3.3 RS-Berger Codes 


Our third method to provide reliable flash memory at low 
voltage involves data coding. We use the concatenation 
of Reed-Solomon [29] and Berger [29] codes—which we 
call RS-Berger codes—to detect and correct errors at read 
time. Reed-Solomon is a widely used error-correcting 
code that can correct twice as many erasures as errors. 
Therefore, if the locations of errors are known, an RS 
code’s error-correcting capacity is improved twofold. 

To detect the location of errors and therefore improve 
the efficiency of the RS code, we use a Berger code, an 
error-detecting code for asymmetric channels. As previ- 
ously mentioned (Section 3.1), flash memory at low volt- 
age can be modeled as a Z-channel for which a Berger 
code is suitable. A Berger codeword consists of two 
parts: k information bits and [log)(k+1)] check bits. 
The check bits of the Berger codeword represents the 
number of zeros in the k information bits. The Berger 
code can detect all zero-to-one errors, because the num- 
ber of zeros in the information-bit component will al- 
ways be less than the number represented by the check- 
bit component. 
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To represent RS-Berger codewords, we use a matrix 
in which each row is an RS codeword except for the last 
row which includes the Berger check bits of the RS code- 
words. In other words, each cell in the last row of the 
matrix is the sum of the number of zeros in the corre- 
sponding cells in all other rows. 


When encoding the data, we first use RS code to gen- 
erate n codewords (rows of the matrix) and then we apply 
a Berger code to compute the check bits for each symbol 
for all codewords (each column of the matrix). 


When decoding data, we first use the Berger decoder 
to check whether or not each column is erroneous. If 
one entry in the column is erroneous, we consider all the 
symbols in the column erasures; otherwise, all the sym- 
bols in the column are considered correct. Then, once 
the error locations are known, we apply RS decoding to 
correct the erroneous sequences row by row. 


Algorithm 3 The encoding and decoding algorithms for 
RS-Berger codes write method. t is the maximum num- 
ber of errors RS-Berger code can correct. 





ENCODE(dataz,._v,1) 


1 fori 1toN 

2 do CW; — RS_ENCODE(data;,n) 

3 WRITE_TO_FLASH(CW;,,address;) 

4 fori—1ton 

5 do for j; << 1 toN 

6 do sym; j; — CW;(i) 

7 chk; — BERGER_-ENCODE(sym;(7___v)) 

8 WRITE_TO_FLASH(chk;,addressy41 + i-1) 


DECODE (addr. (y4.1),1,t) 


1 fori 1ltoN 

2; do chk; — READ_FROM_FLASH(addry + +i-/) 

3 fori—l1toN 

4 do CW; — READ_FROM_FLASH{(addr;) 

5 for jl ton 

6 do sym; — CWi(/) 

7 errors — {} 

8 fori—1l1ton 

9 do err — BERGER_DECODE(symj,7,..v),chk;) 
10 if err =0 
11 then errors — errorsU{i} 


12. if jerrors| <t 
13 then for i — 1 toN 


14 do result; — RS_DECODE(CW;, errors) 
15 return result 
16 else return “fail to correct errors” 
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4 Evaluation 


Our storage techniques are designed for the resource lim- 
itations of low-power devices. In this section, we first 
evaluate the suitability of the three methods proposed in 
Section 3.3 for low-power devices; we then evaluate the 
hypothesis that for CPU-bound workloads, operating at 
low voltage and managing errors is more energy efficient 
than fixing the operating voltage to the maximum of all 
the components’ nominal minimum voltages. 

Summary of results: For a sensor monitoring appli- 
cation that reads 256 data samples from flash memory, 
aggregates data, and stores the results in flash memory, 
use of in-place writes at 1.8 V reduces the energy con- 
sumption up to 34% versus running the same applica- 
tion at 2.2 V (minimum voltage requirement for the flash 
memory). This sensing application models a common 
workload for both wireless sensor nodes and RFID-scale 
devices. 

Experimental setup: We used a consistent experi- 
mental setup to measure the energy consumption and ex- 
ecution time of each program. Using an oscilloscope, we 
measured the voltage of a small resistor in series with a 
MSP430 microcontroller programmed with a task (e.g., 
a flash write). The integration of the current (voltage di- 
vided by the resistance) over the execution time of the 
task multiplied by the operating voltage of the device 
gives the energy consumption of that task (Energy = 
{i(t) dt x V). To facilitate precise identification of the 
task on the oscilloscope, the microcontroller toggled a 
GPIO pin immediately before and after the task. 


4.1 Comparison of the Proposed Storage Methods 


The workload used to measure the performance of each 
of the proposed methods is the storage of accelerome- 
ter traces—generated using the Intel WISP 4.1’s 10-bit 
ADC sensor—to flash memory. The input trace is a se- 
ries of three-dimensional 16-bit samples containing ten 
bits of information. We used a simple data compression 
method to store more data in the available flash memory. 
The compression method involved reading four samples 
of data, preparing the first byte of each sample to be 
stored in flash memory, then combining the remaining 
two bits of each sample into one byte of data. Using 
this compression scheme, we reduced every four samples 
(eight bytes) to five bytes. 

The maximum number of write attempts for both in- 
place writes and multiple-place methods were set to 
two. The RS-Berger codes used three codewords of size 
38 bytes (32 bytes data and 6 bytes parity). These set- 
tings enable all three methods to fit their data in 192 bytes 
of flash memory. Table 3 shows the energy consump- 
tion and time taken for the same workload under each 
method. Both in-place writes and multiple-place writes 
consume less energy and finish more quickly at 1.9 V 
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than at 1.8 V. Both of these methods are feedback based 
and repeat writes if they detect errors. Because there is a 
lower chance of error at 1.9 V, fewer rewrites are required 
than at 1.8 V, so less energy and time are required. 

The in-place writes method slightly outperforms the 
multiple-place writes method at both voltage levels be- 
cause its decoding procedure is less CPU intensive. In- 
place writes method has the best Error Correction Rate 
(ECR in Table 3) of all. The multiple-place writes 
method seems to be the most suitable when there are 
some memory cells that are hard to program and there- 
fore rewriting in those cells is not helpful (Figure 5 
gives an example of such case). Compared to RS-Berger 
codes that always guarantee that a certain number of er- 
rors can be corrected, the in-place writes and multiple- 
place writes methods are less reliable—they offer no 
such guarantees. Therefore, for applications with a hard 
reliability requirement, RS-Berger codes may be more 
suitable if the application knows the error rate in advance 
and is willing to incur extra computational costs for RS- 
Berger encoding and decoding. 


























Method | V | Time (ms) | E (uJ) | ECR 
In-place | 1.8 24.16 59 96% 
M-place | 1.8 25.00 63 84% 
RS-B 1.8 334.45 160 0% 
In-place | 1.9 15.43 38 100% 
M-place | 1.9 16.85 40 100% 
RS-B 1.9 334.73 180 100% 























Table 3: Performance comparison of the proposed meth- 
ods at 1.8 V and 1.9 V. Error Correction Rate (ECR) 
shows the effectiveness of the methods. 


Error Correction Rate: As Table 3 illustrates, the 
two methods that do not use coding—in-place writes and 
multiple-place writes—incur similar energy consump- 
tion costs. We now compare the effectiveness of these 
two approaches with respect to the error correction rate. 

Figure 7 and Figure 8 demonstrate that flash storage 
reliability improves as we increase the number of re- 
peated writes/places at five different voltage levels (all 
below the nominal minimum voltage for flash writes). 

Experiment: Using our automated testbed, the test 
platform runs a program that writes zeros to 192 consec- 
utive bytes of flash memory (using in-place writes and 
multiple-place writes methods in two different experi- 
ments). We increase the maximum number of repeated 
writes from one to ten, one unit at a time. The moni- 
toring platform counts the number of incorrectly stored 
bytes (those that are not set to zero after the experiment). 
The experiment was repeated for five different voltages 
(1.86 V-1.90 V). 








Percentage of incorrect bytes (%) 
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Figure 7: Reliability improvement using in-place writes 
over five different voltages. 
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Figure 8: Reliability improvement using multiple-place 
writes over five different voltages. 


Figure 9 compares the error rate of the in-place and 
multiple-place write methods. We choose the same max- 
imum number of repeated writes for both approaches. As 
the graph shows, the in-place writes method improves 
the error rate more dramatically. We attribute this phe- 
nomenon to the fact that electrons accumulate in flash 
cells with each programming attempt. Figure 9 also al- 
lows us to evaluate hybrids of the in-place writes and 
multiple-place writes methods. For example, choosing 
one place to write the value and repeating the write up 
to three times (up to three writes in total) works better 
than repeating the write up to twice in two places (up to 
four writes in total). This graph offers evidence that a 
pure in-place writes approach works better than a hybrid 
approach or a pure multiple-place writes approach. How- 
ever, we do not conclude that the in-place writes method 
always outperforms the multiple-place writes. A winning 
case for multiple-place writes is when a flash memory 
has unbalanced blocks (different error rates), for exam- 
ple, the chip shown in Figure 5. While multiple-place 
writes method requires more space, it could provide a 
more reliable storage compared to in-place writes. 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


55 


56 


co 
So 


a 
o 


Number of in-place writes 
Error rate(%) at 1.86 V 





Number of places 


Figure 9: The in-place writes method reduces the error 
rate more effectively than multiple-place writes and a hy- 
brid of both methods. 


4.2 Half-wits Versus Wits in Practice 


To evaluate our storage schemes, we consider three test 
cases representing CPU operations, flash read opera- 
tions, and flash write operations. 

The RC5 [30] test case, a CPU-only workload, is a 
commonly used encryption algorithm that can cope with 
the resource limitations of low-power devices [8, 18]. 
RC5 was implemented with a 32-bit word size, 18 rounds, 
and 16 bytes of secret key. 

The retrieve and store test cases are both I/O- 
bound tasks. One reads and the other one writes 192 
bytes of data from/to flash memory. CPU-bound opera- 
tions in these test cases are minimal (essentially only a 
loop that calls a function to flash memory). The store 
program uses in-place writes with a maximum number 
of three (re)writes to deal with errors. Because flash read 
operations are fundamentally simpler than flash write op- 
erations, flash reads are reliable at low voltage. 

We run each of the three test cases on a MSP430F2131 
microcontroller at four different voltages that are all in 
the operating range of this microcontroller (1.8 V—3.5 V). 
Two voltage levels are below the recommended thresh- 
old for flash memory: 1.8 V and 1.9 V. Two voltage lev- 
els are at and above the recommended threshold: 2.2 V 
and 3.0 V. The microcontroller is set to work at its high- 
est possible clock rate for each voltage level in order 
to gain the best energy performance. Figure 10 com- 
pares the average energy consumption over five trials of 
each test case at each voltage. By running at 1.8 V (be- 
low the nominal minimum voltage for flash writes on 
the MSP430F2131), the microcontroller consumes 48% 
and 33% less energy to finish the RC5 and retrieve test 
cases respectively. However, our storage schemes do not 
seem to be beneficial for flash-write-intensive tasks (the 
store test case). 

To evaluate the end-to-end performance of our stor- 
age methods, we have tested a sensor-monitoring appli- 
cation that is CPU-intensive and can benefit from a low- 
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Figure 10: Micro-benchmarks: CPU (RC5), read 


(retrieve), and write (store) energy consumption 
measured at four different voltage levels. Although the 
RC5 and retrieve test cases consume less energy at low 
voltage, this is not the case for the store test case (a 
write-intensive application) as the savings due to running 
the chip at low voltage does not compensate for the en- 
ergy cost required to correct errors. 


voltage storage. This application reads from flash mem- 
ory 256 accelerometer samples (each ten bits), computes 
the maximum, minimum, mean, and standard deviation 
of the samples, and stores the aggregate information in 
flash memory. This monitoring application is a blend of 
CPU and I/O, but it is still a CPU-intensive workload. Ta- 
ble 4 shows that providing the system with a low-voltage 
storage mechanism via our methods helps to decrease the 
task’s total energy consumption by 34%. 


4.3 Finding a Crossover Point 


We can empirically find the point at which the energy 
saved on computation compensates for the added cost 
of repeated flash writes. We compare a workload exe- 
cuted at 2.2 V to the same one running at 1.8 V using 
the in-place writes scheme with the threshold k set to 2. 
We make the worst-case assumption that all data must be 
written to flash twice (no bits change on the first attempt). 
The time spent on flash writes while running at 1.8 V is 














Method In-place | In-place | None None 
1.8V 1.9V 2.2 V 3.0 V 

Clock rate | 6 MHz | 6MHz | 8 MHz | 14 MHz 
Energy(/) 270 300 410 760 
Time(ms) 151.15 151.32 | 113.24 | 64.72 




















Table 4: Energy consumption and execution time for the 
accelerometer sensor application. At voltages below the 
recommended (1.8 V and 1.9 V), in-place writes method 
with a threshold of two is used. 
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then twice the time spent when operating at 2.2 V. We 
also assume that the clock rate of the system is set to the 
highest specified for the CPU at each voltage. Specifi- 
cally, the clock rate would be set to 6 MHz at 1.8 V and 
to 8 MHz at 2.2 V. 

We empirically determined the power consumption 
of CPU and flash writes with 1.8 V and 2.2 V voltage 
supplies. Pc_1.3 = 1.8 mW, Po 22 =3.4 mW, Pris = 
3.7 mW, and Pr _2.2 =5.8 mW. The variables Tc and Tr 
are the time spent in computation and on flash memory 
respectively. With these assumptions, we can write the 
following inequality to determine whether a given work- 
load is likely to result in reduced energy consumption: 


Energy.3 < Energy22 > 
Pog X Tog + Pris Xk x Trig < 


Po_22 X Te_2.2+Pr 22 X Tr 2.2 > 


8MHz SMH. 
Pos X pe X To2.2+Priis Xk x Sigg X Tr22< 


Po.22*Tc22+Pr 22 x Tr_22 





The solution with k = 2 is T¢ 22 > 4 x Tr_22. There- 
fore, in-place writes are competitive over normal flash 
writes when the time spent on computation is at least four 
times greater than the time spent on flash writes. 


5 Improvements and Alternatives 


This section describes several complementary ways to 
further decrease the energy requirements of our schemes. 

Hardware. One could add an adjustable voltage 
regulator and about a dozen other analog components 
such that software could toggle a GPIO for discrete dy- 
namic voltage scaling. A feedback loop that dynamically 
adjusts a voltage supply could help identify the mini- 
mum voltage at which no write errors are detected, but 
such boundaries can vary with temperature and wear-out. 
Thus, our coding algorithms would remain helpful to 
cope with potential errors. Our work seeks to avoid hard- 
ware modification that would require additional compo- 
nents or design changes to a Printed Circuit Board (PCB) 
because embedded applications are often cost sensitive. 
Changing the PCB layout may require a manufacturer 
to flush its supply chain of parts typically manufactured 
in high volume. If an inexpensive, software-only ap- 
proach with minimal disturbance to manufacturing can 
lead to significant savings in energy consumption, then 
it is hard to financially justify an expensive hardware ap- 
proach that offers only comparable performance. 

Sign bits and storing complements. As discussed in 
Section 2.3, one of the major factors influencing the error 
rate is the Hamming weight of a number. One way to im- 
prove the performance of the low-voltage storage meth- 
ods is to store numbers with greater Hamming weights 
(weight > 4) in flash memory. If a number is lightweight 
(weight < 4), the complement of the number would be 
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Figure 11: ECG data stored in flash memory at 1.89 V 
(the same chip from Figure 2) improved by using a sign 
bit. The light-colored bars show the difference between 
the ECG stored at low voltage and the original ECG data. 


stored and a sign bit would be set for future data ac- 
cess. An array of sign bits can be stored separately from 
the data to avoid disturbing word alignment. A previous 
work [26] uses a similar technique for multi-level cell 
(MLC) flash memories with four levels; their techniques 
result in a significant decrease of energy consumption. 
Figure 11 shows that using the sign-bit scheme decreases 
the error rate at low voltage for the same ECG data used 
in Section 2. For this specific example, out of 168 bytes 
of ECG data, 160 bytes are overweight and therefore us- 
ing the sign-bit scheme greatly decreased the error rate. 
The sign-bit approach involves very lightweight compu- 
tation (counting the number of ones) and increases the 
number of writes by a factor of one-eighth. Therefore, 
the effect of this improvement on energy consumption 
and delay should be comparatively small. 


Memory mapping table. Another method to exploit 
the fact that numbers with greater Hamming weights 
have a lower probability of error is to map the most fre- 
quently used numbers in the user’s data to the heavier 
numbers. The solution we suggest is to preprocess the 
data to sort numbers based on their frequency of use. 
A simple memory mapping table would map the most 
frequent numbers to the heaviest numbers. Such a table 
could be preloaded in flash memory so that storing the 
table would not consume energy at run time. Use of a 
memory mapping table would only increase the number 
of reads and would not increase the number of writes. 
Therefore, the energy consumption overhead and the de- 
lay should be smaller than the sign bit method. 


An ideal, unrealizable scheme. We initially tried to 
set the voltage to a level lower than recommended but 
high enough to avoid errors. This method could not be 
realized for two reasons: finding a voltage that satisfies 
this condition requires a large number of experiments per 
chip—error rate varies chip by chip (Figure 3), and the 
error rate of flash writes varies depending on its lifespan 
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and its environment. We found that the byte error rate of 
MSP430F2131 that is 63% at 1.83 V at 25°C becomes 
negligible when the temperature goes up to 39°C. 


6 Related Work 


Storage for low-power embedded devices: Recent 
research focuses on optimizing use of off-chip flash 
memory. Off-chip memory allows for special features 
and larger memories than found on microcontrollers, 
but introduces additional costs for components. Micro- 
hash [38] is a memory index structure tailored for sen- 
sor devices with a large external flash memory. Mathur 
et al. [23] perform an extensive study of available flash 
memory candidates for sensor devices and demonstrate 
that an off-chip parallel NAND flash memory decreases 
the energy consumption of storage. Considering the off- 
chip NAND flash memory as the best candidate for sen- 
sor devices, Agrawal et al. [1] propose a method that al- 
lows sensor devices to exploit their flash memory while 
adapting to different amount of RAM. However, our 
storage schemes are designed for already deployed low- 
power devices that use on-chip flash memory. Moreover, 
while devices at the scale of sensor nodes might switch to 
block-grained, large off-chip flash memory, RFID-scale 
platforms might not benefit from this transition because 
of their challenging resource limitations to drive I/O. 


Energy proportionality: Our approaches share the 
philosophy that energy consumption should scale pro- 
portionally to utilization or error rates rather than pro- 
portional to a worst-case scenario. Blaauw et al. [6] re- 
duce power consumption by lowering the operating volt- 
age of a pipelined CPU. Certain pipeline stages may pro- 
duce incorrect computation that require recomputation, 
but the errors can be made rare to allow better scalabil- 
ity of power consumption. Misailovic et al. [25] demon- 
strate that the programs whose loops performs fewer it- 
erations cause tolerable errors while their execution time 
becomes shorter. Weddle et al. [37] introduce PARAID, 
a scheme that scales power based on the user demand 
while maintaining the reliability of the system. Their 
present work also tries to scale power based on the uti- 
lization of flash memory without losing storage reliabil- 
ity. Our approaches share this philosophy of scaling per- 
formance with utilization. Our performance metric is en- 
ergy consumption, writes to flash memory represent our 
utilization, and energy-efficient error correction is our 
coping mechanism. 


Error correction codes for storage: Most previously 
published flash error correction codes [9, 11, 14] are de- 
signed for NAND flash memory. Chen et al. [10] men- 
tion that NOR flash normally does not require error cor- 
rection. These techniques consider neither the asymme- 
try in low-voltage flash memory nor the resource limi- 
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tations of low-power embedded devices. Many previous 
codes [4, 16, 40, 35] leverage the fact that each cell of 
MLC flash memory represents more than one bit of in- 
formation. But the fact that single-level cells (SLC) are 
more suitable for embedded devices, in addition to the 
occurrence of errors in low-voltage conditions, requires 
a reconsideration of these codes for SLCs at low voltage. 
Zemor et al. [39] introduce error-correcting WOM codes 
for flash memory. They suggest codes that are able to 
correct up to one error when the flash memory is given 
enough voltage. This work does not account for errors 
that occur at low voltage. Godard et al. [12] propose hier- 
archical code correction and reliability management for 
NOR flash memory. This work considers on-chip ECCs 
such as Hamming and parity codes to correct the errors 
in NOR flash memory. 


7 Conclusions and Future Work 


The high voltage requirement of on-chip flash memory 
is a barrier to reducing the total energy consumption of 
low-power devices. This work examines the main fac- 
tors affecting the behavior of flash memory at low volt- 
age. Based on our observations of flash memory behav- 
ior at low voltage, we proposed three storage schemes— 
in-place writes, multiple-place writes, and RS-Berger 
codes—that aim to make flash memory available and re- 
liable at low voltage while tolerating the resource limi- 
tations of low-power devices. Our evaluation shows that 
in-place writes can save 34% of energy consumption for 
a sensing workload on the MSP430 microcontroller. 
Future work includes finding more energy-efficient 
coding schemes to combat flash write errors caused by 
low voltage. Currently, the system cannot take full ad- 
vantage of dynamic voltage scaling. Another plan is to 
introduce benchmarks for the storage systems of low- 
power devices. The standard benchmarks used to eval- 
uate the storage systems designed for desktop computers 
are not immediately applicable to the low-power domain. 
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Abstract 


The predicted shift to non-volatile, byte-addressable 
memory (e.g., Phase Change Memory and Memnristor), 
the growth of “big data’, and the subsequent emergence 
of frameworks such as memcached and NoSQL systems 
require us to rethink the design of data stores. To de- 
rive the maximum performance from these new mem- 
ory technologies, this paper proposes the use of single- 
level data stores. For these systems, where no distinc- 
tion is made between a volatile and a persistent copy of 
data, we present Consistent and Durable Data Structures 
(CDDSs) that, on current hardware, allows programmers 
to safely exploit the low-latency and non-volatile as- 
pects of new memory technologies. CDDSs use version- 
ing to allow atomic updates without requiring logging. 
The same versioning scheme also enables rollback for 
failure recovery. When compared to a memory-backed 
Berkeley DB B-Tree, our prototype-based results show 
that a CDDS B-Tree can increase put and get through- 
put by 74% and 138%. When compared to Cassandra, 
a two-level data store, Tembo, a CDDS B-Tree enabled 
distributed Key-Value system, increases throughput by 
up to 250%—286%. 


1 Introduction 


Recent architecture trends and our conversations with 
memory vendors show that DRAM density scaling is fac- 
ing significant challenges and will hit a scalability wall 
beyond 40nm [26, 33, 34]. Additionally, power con- 
straints will also limit the amount of DRAM installed in 
future systems [5, 19]. To support next generation sys- 
tems, including large memory-backed data stores such 
as memcached [18] and RAMCloud [38], technologies 
such as Phase Change Memory [40] and Memristor [48] 
hold promise as DRAM replacements. Described in Sec- 
tion 2, these memories offer latencies that are compara- 
ble to DRAM and are orders of magnitude faster than ei- 


ther disk or flash. Not only are they byte-addressable and 
low-latency like DRAM but, they are also non-volatile. 


Projected cost [19] and power-efficiency characteris- 
tics of Non-Volatile Byte-addressable Memory (NVBM) 
lead us to believe that it can replace both disk and mem- 
ory in data stores (e.g., memcached, database systems, 
NoSQL systems, etc.) but not through legacy inter- 
faces (e.g., block interfaces or file systems). First, the 
overhead of PCI accesses or system calls will dominate 
NVBM’s sub-microsecond access latencies. More im- 
portantly, these interfaces impose a two-level logical sep- 
aration of data, differentiating between in-memory and 
on-disk copies. Traditional data stores have to both up- 
date the in-memory data and, for durability, sync the data 
to disk with the help of a write-ahead log. Not only does 
this data movement use extra power [5] and reduce per- 
formance for low-latency NVBM, the logical separation 
also reduces the usable capacity of an NVBM system. 


Instead, we propose a single-level NVBM hierarchy 
where no distinction is made between a volatile and a 
persistent copy of data. In particular, we propose the use 
of Consistent and Durable Data Structures (CDDSs) to 
store data, a design that allows for the creation of log- 
less systems on non-volatile memory without processor 
modifications. Described in Section 3, these data struc- 
tures allow mutations to be safely performed directly 
(using loads and stores) on the single copy of the data 
and metadata. We have architected CDDSs to use ver- 
sioning. Independent of the update size, versioning al- 
lows the CDDS to atomically move from one consis- 
tent state to the next, without the extra writes required 
by logging or shadow paging. Failure recovery simply 
restores the data structure to the most recent consistent 
version. Further, while complex processor changes to 
support NVBM have been proposed [14], we show how 
primitives to provide durability and consistency can be 
created using existing processors. 


We have implemented a CDDS B-Tree because of its 
non-trivial implementation complexity and widespread 
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use in storage systems. Our evaluation, presented in 
Section 4, shows that a CDDS B-Tree can increase put 
and get throughput by 74% and 138% when compared 
to a memory-backed Berkeley DB B-Tree. Tembo!, our 
Key-Value (KV) store described in Section 3.5, was cre- 
ated by integrating this CDDS B-Tree into a widely-used 
open-source KV system. Using the Yahoo Cloud Serv- 
ing Benchmark [15], we observed that Tembo increases 
throughput by up to 250%-286% when compared to 
memory-backed Cassandra, a two-level data store. 


2 Background and Related Work 


2.1 Hardware Non-Volatile Memory 


Significant changes are expected in the memory indus- 
try. Non-volatile flash memories have seen widespread 
adoption in consumer electronics and are starting to gain 
adoption in the enterprise market [20]. Recently, new 
NVBM memory technologies (e.g., PCM, Memristor, 
and STTRAM) have been demonstrated that significantly 
improve latency and energy efficiency compared to flash. 

As an illustration, we discuss Phase Change Mem- 
ory (PCM) [40], a promising NVBM technology. PCM 
is a non-volatile memory built out of Chalcogenide- 
based materials (e.g., alloys of germanium, antimony, 
or tellurium). Unlike DRAM and flash that record data 
through charge storage, PCM uses distinct phase change 
material states (corresponding to resistances) to store val- 
ues. Specifically, when heated to a high temperature for 
an extended period of time, the materials crystallize and 
reduce their resistance. To reset the resistance, a current 
large enough to melt the phase change material is applied 
for a short period and then abruptly cut-off to quench the 
material into the amorphous phase. The two resistance 
states correspond to a ‘0’ and ‘1’, but, by varying the 
pulse width of the reset current, one can partially crystal- 
lize the phase change material and modify the resistance 
to an intermediate value between the ‘0’ and ‘1’ resis- 
tances. This range of resistances enables multiple bits 
per cell, and the projected availability of these MLC de- 
signs is 2012 [25]. 

Table 1 summarizes key attributes of potential stor- 
age alternatives in the next decade, with projected data 
from recent publications, technology trends, and direct 
industry communication. These trends suggest that fu- 
ture non-volatile memories such as PCM or Memris- 
tors can be viable DRAM replacements, achieving com- 
petitive speeds with much lower power consumption, 
and with non-volatility properties similar to disk but 
without the power overhead. Additionally, a number 
of recent studies have identified a slowing of DRAM 


' Swahili for elephant, an animal anecdotally known for its memory. 
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growth [25, 26, 30, 33, 34, 39, 55] due to scaling chal- 
lenges for charge-based memories. In conjunction with 
DRAM’s power inefficiencies [5, 19], these trends can 
potentially accelerate the adoption of NVBM memories. 

NVBM technologies have traditionally been limited 
by density and endurance, but recent trends suggest that 
these limitations can be addressed. Increased density can 
be achieved within a single-die through multi-level de- 
signs, and, potentially, multiple-layers per die. At a sin- 
gle chip level, 3D die stacking using through-silicon vias 
(TSVs) for inter-die communication can further increase 
density. PCM and Memnristor also offer higher endurance 
than flash (108 writes/cell compared to 10° writes/cell 
for flash). Optimizations at the technology, circuit, and 
systems levels have been shown to further address en- 
durance issues, and more improvements are likely as the 
technologies mature and gain widespread adoption. 

These trends, combined with the attributes summa- 
rized in Table 1, suggest that technologies like PCM and 
Mennristors can be used to provide a single “unified data- 
store” layer - an assumption underpinning the system ar- 
chitecture in our paper. Specifically, we assume a stor- 
age system layer that provides disk-like functionality but 
with memory-like performance characteristics and im- 
proved energy efficiency. This layer is persistent and 
byte-addressable. Additionally, to best take advantage 
of the low-latency features of these emerging technolo- 
gies, non-volatile memory is assumed to be accessed off 
the memory bus. Like other systems [12, 14], we also as- 
sume that the hardware can perform atomic 8 byte writes. 

While our assumed architecture is future-looking, it 
must be pointed out that many of these assumptions are 
being validated individually. For example, PCM sam- 
ples are already available (e.g., from Numonyx) and 
an HP/Hynix collaboration [22] has been announced 
to bring Memristor to market. In addition, aggressive 
capacity roadmaps with multi-level cells and stacking 
have been discussed by major memory vendors. Finally, 
previously announced products have also allowed non- 
volatile memory, albeit flash, to be accessed through the 
memory bus [46]. 


2.2 File Systems 


Traditional disk-based file systems are also faced with 
the problem of performing atomic updates to data struc- 
tures. File systems like WAFL [23] and ZFS [49] use 
shadowing to perform atomic updates. Failure recovery 
in these systems is implemented by restoring the file sys- 
tem to a consistent snapshot that is taken periodically. 
These snapshots are created by shadowing, where every 
change to a block creates a new copy of the block. Re- 
cently, Rodeh [42] presented a B-Tree construction that 
can provide efficient support for shadowing and this tech- 
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Technology Density Read/Write Latency Read/Write Energy Endurance 
um?/bit ns pJ/bit writes/bit 
HDD 0.00006 3,000,000 3,000,000 2,500 2,500 oo 
Flash SSD (SLC) | 0.00210 25,000 200,000 250 250 10° 
DRAM (DIMM) | 0.00380 55 55 24 24 1018 
PCM 0.00580 48 150 2 20 108 
Memristor 0.00580 100 100 2 2 108 


Table 1: Non-Volatile Memory Characteristics: 2015 Projections 


nique has been used in the design of BTRFS [37]. Failure 
recovery in a CDDS uses a similar notion of restoring 
the data structure to the most recent consistent version. 
However the versioning scheme used in a CDDS results 
in fewer data-copies when compared to shadowing. 


2.3 Non-Volatile Memory-based Systems 


The use of non-volatile memory to improve performance 
is not new. eNVy [54] designed a non-volatile main 
memory storage system using flash. eNVy, however, ac- 
cessed memory on a page-granularity basis and could not 
distinguish between temporary and permanent data. The 
Rio File Cache [11, 32] used battery-backed DRAM to 
emulate NVBM but it did not account for persistent data 
residing in volatile CPU caches. Recently there have 
been many efforts [21] to optimize data structures for 
flash memory based systems. FD-Tree [31] and Buffer- 
Hash [2] are examples of write-optimized data structures 
designed to overcome high-latency of random writes, 
while FAWN [3] presents an energy efficient system de- 
sign for clusters using flash memory. However, design 
choices that have been influenced by flash limitations 
(e.g., block addressing and high-latency random writes) 
render these systems suboptimal for NVBM. 

Qureshi et al. [39] have also investigated combining 
PCM and DRAM into a hybrid main-memory system 
but do not use the non-volatile features of PCM. While 
our work assumes that NVBM wear-leveling happens 
at a lower layer [55], it is worth noting that versioning 
can help wear-leveling as frequently written locations are 
aged out and replaced by new versions. Most closely re- 
lated is the work on NVTM [12] and BPFS [14]. NVTM, 
amore general system than CDDS, adds STM-based [44] 
durability to non-volatile memory. However, it requires 
adoption of an STM-based programming model. Fur- 
ther, because NVTM only uses a metadata log, it cannot 
guarantee failure atomicity. BPFS, a PCM-based file sys- 
tem, also proposes a single-level store. However, unlike 
CDDS’s exclusive use of existing processor primitives, 
BPFS depends on extensive hardware modifications to 
provide correctness and durability. Further, unlike the 
data structure interface proposed in this work, BPFS im- 
plements a file system interface. While this is transparent 
to legacy applications, the system-call overheads reduce 
NVBM’s low-latency benefits. 


2.4 Data Store Trends 


The growth of “big data” [1] and the corresponding need 
for scalable analytics has driven the creation of a num- 
ber of different data stores today. Best exemplified by 
NoSQL systems [9], the throughput and latency require- 
ments of large web services, social networks, and social 
media-based applications have been driving the design 
of next-generation data stores. In terms of storage, high- 
performance systems have started shifting from mag- 
netic disks to flash over the last decade. Even more 
recently, this shift has accelerated to the use of large 
memory-backed data stores. Examples of the latter in- 
clude memcached [18] clusters over 200 TB in size [28], 
memory-backed systems such as RAMCloud [38], in- 
memory databases [47, 52], and NoSQL systems such 
as Redis [41]. As DRAM is volatile, these systems pro- 
vide data durability using backend databases (e.g., mem- 
cached/MySQL), on-disk logs (e.g., RAMCloud), or, for 
systems with relaxed durability semantics, via periodic 
checkpoints. We expect that these systems will easily 
transition from being DRAM-based with separate persis- 
tent storage to being NVBM-based. 


3 Design and Implementation 


As mentioned previously, we expect NVBM to be ex- 
posed across a memory bus and not via a legacy disk 
interface. Using the PCI interface (256 ns latency [24]) 
or even a kernel-based syscall API (89.2 and 76.4 ns for 
POSIX read/write) would add significant overhead 
to NVBM’s access latencies (50-150 ns). Further, given 
the performance and energy cost of moving data, we be- 
lieve that all data should reside in a single-level store 
where no distinction is made between volatile and persis- 
tent storage and all updates are performed in-place. We 
therefore propose that data access should use userspace 
libraries and APIs that map data into the process’s ad- 
dress space. 

However, the same properties that allow systems to 
take full advantage of NVBM’s performance proper- 
ties also introduce challenges. In particular, one of the 
biggest obstacles is that current processors do not pro- 
vide primitives to order memory writes. Combined with 
the fact that the memory controller can reorder writes (at 
a cache line granularity), current mechanisms for updat- 
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ing data structures are likely to cause corruption in the 
face of power or software failures. For example, assume 
that a hash table insert requires the write of a new hash 
table object and is followed by a pointer write linking 
the new object to the hash table. A reordered write could 
propagate the pointer to main memory before the object 
and a failure at this stage would cause the pointer to link 
to an undefined memory region. Processor modifications 
for ordering can be complex [14], do not show up on 
vendor roadmaps, and will likely be preceded by NVBM 
availability. 

To address these issues, our design and implemen- 
tation focuses on three different layers. First, in Sec- 
tion 3.1, we describe how we implement ordering and 
flushing of data on existing processors. However, this 
low-level primitive is not sufficient for atomic updates 
larger than 8 bytes. In addition, we therefore also re- 
quire versioning CDDSs, whose design principles are 
described in Section 3.2. After discussing our CDDS B- 
Tree implementation in Section 3.3 and some of the open 
opportunities and challenges with CDDS data structures 
in Section 3.4, Section 3.5 describes Tembo, the system 
resulting from the integration of our CDDS B-Tree into 
a distributed Key- Value system. 


3.1 Flushing Data on Current Processors 


As mentioned earlier, today’s processors have no mecha- 
nism for preventing memory writes from reaching mem- 
ory and doing so for arbitrarily large updates would be 
infeasible. Similarly, there is no guarantee that writes 
will not be reordered by either the processor or by the 
memory controller. While processors support amfence 
instruction, it only provides write visibility and does not 
guarantee that all memory writes are propagated to mem- 
ory (NVBM in this case) or that the ordering of writes is 
maintained. While cache contents can be flushed using 
the wbinvd instruction, it is a high-overhead operation 
(multiple ms per invocation) and flushes the instruction 
cache and other unrelated cached data. While it is pos- 
sible to mark specific memory regions as write-through, 
this impacts write throughput as all stores have to wait 
for the data to reach main memory. 

To address this problem, we use a combination of 
tracking recently written data and use of the mfence 
and clflush instructions. cl flush is an instruction 
that invalidates the cache line containing a given mem- 
ory address from all levels of the cache hierarchy, across 
multiple processors. If the cache line is dirty (i.e., it has 
uncommitted data), it is written to memory before inval- 
idation. The cl flush instruction is also ordered by the 
mfence instruction. Therefore, to commit a series of 
memory writes, we first execute an mfence as a barrier 
to them, execute a clflush on every cacheline of all 
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modified memory regions that need to be committed to 
persistent memory, and then execute another mfence. 
In this paper, we refer to this instruction sequence as a 
flush. As microbenchmarks in Section 4.2 show, us- 
ing flush will be acceptable for most workloads. 

While this description and tracking dirty memory 
might seem complex, this was easy to implement in prac- 
tice and can be abstracted away by macros or helper 
functions. In particular, for data structures, all up- 
dates occur behind an API and therefore the process of 
flushing data to non-volatile memory is hidden from 
the programmer. Using the simplified hash table example 
described above, the implementation would first write 
the object and flush it. Only after this would it write 
the pointer value and then flush again. This two-step 
process is transparent to the user as it occurs inside the 
insert method. 

Finally, one should note that while flush is neces- 
sary for durability and consistency, it is not sufficient by 
itself. If any metadata update (e.g., rebalancing a tree) 
requires an atomic update greater than the 8 byte atomic 
write provided by the hardware, a failure could leave it 
in an inconsistent state. We therefore need the versioning 
approach described below in Sections 3.2 and 3.3. 


3.2 CDDS Overview 


Given the challenges highlighted at the beginning of Sec- 
tion 3, an ideal data store for non-volatile memory must 
have the following properties: 


e Durable: The data store should be durable. A fail- 
stop failure should not lose committed data. 


e Consistent: The data store should remain consis- 
tent after every update operation. If a failure occurs 
during an update, the data store must be restored to 
a consistent state before further updates are applied. 


e Scalable: The data store should scale to arbitrarily- 
large sizes. When compared to traditional data 
stores, any space, performance, or complexity over- 
head should be minimal. 

e Easy-to-Program: Using the data store should not 
introduce undue complexity for programmers or un- 
reasonable limitations to its use. 


We believe it is possible to meet the above properties 
by storing data in Consistent and Durable Data Struc- 
tures (CDDSs), i.e., hardened versions of conventional 
data structures currently used with volatile memory. The 
ideas used in constructing a CDDS are applicable to a 
wide variety of linked data structures and, in this paper, 
we implement a CDDS B-Tree because of its non-trivial 
implementation complexity and widespread use in stor- 
age systems. We would like to note that the design and 
implementation of a CDDS only addresses physical con- 
sistency, i.e., ensuring that the data structure is readable 


USENIX Association 


USENIX Association 















20 oo moo. 
(4,6) | (1,4) | (4.6) 
5 10 | 20 > 20 | 99 
(4,6) | [5,6) | [4.6) (2,4) | (3,4) | [1.4 





5 | 8 | 10 
[6.-) | [7.-) | [6.-) 


Key 
Start End 
Version’ Version 


[| Live entry 
[| Dead entry 








10 | 20 | 99 
[6.-) | [6.-) | [6.-) 
13 | 15 | 20 
[9.-) } 16.8) | [6.-) 





Figure 1: Example of a CDDS B-Tree 


and never left in a corrupt state. Higher-level layers con- 
trol logical consistency, i.e., ensuring that the data stored 
in the data structure is valid and matches external in- 
tegrity constraints. Similarly, while our current system 
implements a simple concurrency control scheme, we do 
not mandate concurrency control to provide isolation as 
it might be more efficient to do it at a higher layer. 

A CDDS is built by maintaining a limited number of 
versions of the data structure with the constraint that an 
update should not weaken the structural integrity of an 
older version and that updates are atomic. This version- 
ing scheme allows a CDDS to provide consistency with- 
out the additional overhead of logging or shadowing. A 
CDDS thus provides a guarantee that a failure between 
operations will never leave the data in an inconsistent 
state. As a CDDS never acknowledges completion of 
an update without safely committing it to non-volatile 
memory, it also ensures that there is no silent data loss. 


3.2.1 Versioning for Durability 


Internally, a CDDS maintains the following properties: 


e There exists a version number for the most recent 
consistent version. This is used by any thread which 
wishes to read from the data structure. 


e Every update to the data structure results in the cre- 
ation of a new version. 


e During the update operation, modifications ensure 
that existing data representing older versions are 
never overwritten. Such modifications are per- 
formed by either using atomic operations or copy- 
on-write style changes. 

e After all the modifications for an update have been 
made persistent, the most recent consistent version 
number is updated atomically. 


3.2.2 Garbage Collection 


Along with support for multiple versions, a CDDS also 
tracks versions of the data structure that are being ac- 
cessed. Knowing the oldest version which has a non-zero 
reference count has two benefits. First, we can garbage 
collect older versions of the data structure. Garbage col- 
lection (GC) is run in the background and helps limit the 


space utilization by eliminating data that will not be ref- 
erenced in the future. Second, knowing the oldest active 
version can also improve performance by enabling in- 
telligent space reuse in a CDDS. When creating a new 
entry, the CDDS can proactively reclaim the space used 
by older inactive versions. 


3.2.3 Failure Recovery 


Insert or delete operations may be interrupted due to 
operating system crashes or power failures. By defini- 
tion, the most recent consistent version of the data struc- 
ture should be accessible on recovery. However, an in- 
progress update needs to be removed as it belongs to an 
uncommitted version. We handle failures in a CDDS 
by using a ‘forward garbage collection’ procedure dur- 
ing recovery. This process involves discarding all up- 
date operations which were executed after the most re- 
cent consistent version. New entries created can be dis- 
carded while older entries with in-progress update oper- 
ations are reverted. 


3.3. CDDS B-Trees 


As an example of a CDDS, we selected the B-Tree [13] 
data structure because of its widespread use in databases, 
file systems, and storage systems. This section dis- 
cusses the design and implementation of a consistent and 
durable version of a B-Tree. Our B-Tree modifications” 
have been heavily inspired by previous work on multi- 
version data structures [4, 50]. However, our focus on 
durability required changes to the design and impacted 
our implementation. We also do not retain all previous 
versions of the data structure and can therefore optimize 
updates. 

In a CDDS B-Tree node, shown in Figure 1, the key 
and value stored in a B-Tree entry is augmented with a 
start and end version number, represented by unsigned 
64-bit integers. A B-Tree node is considered ‘live’ if it 
has at least one live entry. In turn, an entry is considered 
‘live’ if it does not have an end version (displayed as a 
‘“—’ in the figure). To bound space utilization, in addition 
to ensuring that a minimum number of entries in a B-Tree 
node are used, we also bound the minimum number of 


7In reality, our B-Tree is a B+ Tree with values only stored in leaves. 
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Algorithm 1: CDDS B-Tree Lookup 
Input: k: key, r: root 
Output: val: value 








1 begin Lookup (k, r) 
2 v — current_version 
3 nr 
4 while is_inner_node (n) do 
5 entry_num — find (k, n, v) 
6 n —nlentry_num|.child 
7 entry num — find (k, n, v) 
8 return n[entry_num].value 
9 end 
10 begin find (k, n, v) 
11 1-0 
12 h—get_num_entries (n) 
13 while | <hdo // Binary Serch 
14 m<— (1+h)/2 
15 if k < n{m|.key then 
16 [ h—m-1 
17 else / —m+1 
18 while h) < get_num_entries (n) do 
19 if n{hj.start < v then 
20 if n{hj.end>v || n{h|.end=0 then 
21 | break 
22 h—h+l1 
23 return 
24 end 





live entries in each node. Thus, while the CDDS B-Tree 
API is identical to normal B-Trees, the implementation 
differs significantly. In the rest of this section, we use the 
lookup, insert, and delete operations to illustrate how the 
CDDS B-Tree design guarantees consistency and dura- 
bility?. 


3.3.1 Lookup 


We first briefly describe the lookup algorithm, shown in 
Algorithm 1. For ease of explanation and brevity, the 
pseudocode in this and following algorithms does not in- 
clude all of the design details. The algorithm uses the 
find function to recurse down the tree (lines 4-6) until 
it finds the leaf node with the correct key and value. 

Consider a lookup for the key 10 in the CDDS B-Tree 
shown in Figure 1. After determining the most current 
version (version 9, line 2), we start from the root node 
and pick the rightmost entry with key 99 as it is the next 
largest valid key. Similarly in the next level, we follow 
the link from the leftmost entry and finally retrieve the 
value for 10 from the leaf node. 

Our implementation currently optimizes lookup per- 
formance by ordering node entries by key first and 
then by the start version number. This involves extra 
writes during inserts to shift entries but improves read 
performance by enabling a binary search within nodes 


3A longer technical report [51] presents more details on all CDDS 
B-Tree operations and their corresponding implementations. 
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Algorithm 2: CDDS B-Tree Insertion 
Input: k: key, r: root 
1 begin insert_key (k, r) 




















2 v — current _version 
3 vev+l 
// Recurse to leaf node (n) 
4 y+ get_num_ent ries (n) 
5 if y = node_size then // Node Full 
6 if entry. num = can_reuse_version (n,y) then 
7 nlentry_num].key — k 
8 nlentry_num|.start — v! 
9 nlentry_num].end — 0 
10 flush (n[entry_num]) 
11 else 
12 split_insert (n, k, v’) 
// Update inner nodes 
13 else 
14 nly|.key — k 
15 n{y].start — v! 
16 n{y].end —0 
17 flush (n[y]) 
18 current_version — v' 
19 flush (current _version) 
20 end 
21 begin split_insert (n, k, v) 
22 1—num_live_entries (n) 
23 mj; —min_live_entries 
24 if] > 4m, then 
25 nn, <— new_node 
26 nn2 <— new_node 
27 for i=1to1/2do 
28 insert (nn,,n{i].key,v) 
29 for i=1/2+1 tol do 
30 insert (nnz,n{i].key,v) 
31 if k < n{l/2].key then 
32 insert (nn,,k,v) 
33 else insert (nnz,k,v) 
34 flush (nnj1,nn2) 
35 else 
36 nn <— new_node 
37 for i=1toldo 
38 | insert (nn,n{i].key,v) 
39 insert (nn,k,v) 
40 flush (nn) 
41 for i= 1 toldo 
42 n{i].end —v 
43 flush (n) 
44 end 





(lines 13-17 in find). While we have an alternate im- 
plementation that optimizes writes by not ordering keys 
at the cost of higher lookup latencies, we do not use it 
as our target workloads are read-intensive. Finally, once 
we detect the right index in the node, we ensure that we 
are returning a version that was valid for v, the requested 
version number (lines 18—22). 
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3.3.2 Insertion 


The algorithm for inserting a key into a CDDS B-Tree 
is shown in Algorithm 2. Our implementation of the 
algorithm uses the flush operation (described in Sec- 
tion 3.1) to perform atomic operations on a cacheline. 
Consider the case where a key, 12, is inserted into the 
B-Tree shown in Figure 1. First, an algorithm similar 
to lookup is used to find the leaf node that contains the 
key range that 12 belongs to. In this case, the right-most 
leaf node is selected. As shown in lines 2-3, the cur- 
rent consistent version is read and a new version number 
is generated. As the leaf node is full, we first use the 
can_reuse_version function to check if an existing 
dead entry can be reused. In this case, the entry with key 
15 died at version 8 and is reused. To reuse a slot we 
first remove the key from the node and shift the entries 
to maintain them in sorted order. Now we insert the new 
key and again shift entries as required. For each key shift, 
we ensure that the data is first flushed to another slot 
before it is overwritten. This ensures that the safety prop- 
erties specified in Section 3.2.1 are not violated. While 
not described in the algorithm, if an empty entry was de- 
tected in the node, it would be used and the order of the 
keys, as specified in Section 3.3.1, would be maintained. 


If no free or dead entry was found, a split_insert, 
similar to a traditional B-Tree split, would be performed. 
split_insert is a copy-on-write style operation in 
which existing entries are copied before making a mod- 
ification. As an example, consider the node shown in 
Figure 2, where the key 40 is being inserted. We only 
need to preserve the ‘live’ entries for further updates and 
split _insert creates one or two new nodes based on 
the number of live entries present. Note that setting the 
end version (lines 41-42) is the only change made to the 
existing leaf node. This ensures that older data versions 
are not affected by failures. In this case, two new nodes 
are created at the end of the split. 


The inner nodes are now updated with links to the 
newly created leaf nodes and the parent entries of the 
now-dead nodes are also marked as dead. A similar 
procedure is followed for inserting entries into the inner 
nodes. When the root node of a tree overflows, we split 
the node using the split_insert function and create 
one or two new nodes. We then create a new root node 
with links to the old root and to the newly created split- 
nodes. The pointer to the root node is updated atomically 
to ensure safety. 


Once all the changes have been flushed to persistent 
storage, the current consistent version is update atomi- 
cally (lines 18-19). At this point, the update has been 
successfully committed to the NVBM and failures will 
not result in the update being lost. 


5 20 | 99 Tnsert 40 
[2.-) | [3.-) | [L-) [4,-) 


see eee ee ee 


5 | 20] 99 || 5 | 20 ][ 40 | 99 
(2.4) | (3.4) | (1.4) | |[4-) | (4) | [4 | Ao 


Figure 2: CDDS node split during insertion 


Algorithm 3: CDDS B-Tree Deletion 


Input: k: key, r: root 
1 begin delete (k, r) 











2 v — current _version 
3 vev4+l 
// Recurse to leaf node (n) 
4 y+ find_entry (n, k) 
5 n{y].end — v! 
6 7—num_live_entries (n) 
7 if / = m, then // Underflow 
8 s< pick_sibling(n) 
9 I, —num_live_entries (s) 
10 if 1; > 3 x m then 
11 | copy_from_sibling (yn, s, v’) 
12 else merge_with_sibling (n, s, ’) 
// Update inner nodes 
13 else flush (n[y}) 
14 current_version — v' 
15 flush (current_version) 
16 end 
17 begin merge_with_sibling(n, s, v) 
18 y+ get_num_entries (s) 
19 if y <4 x m; then 
20 for i= 1 to m; do 
21 insert (s,n[i].key,v) 
22 | nli].end —v 
23 else 
24 nn <— new_node 
25 1, —num_live_entries (s) 
26 for i= 1 tol, do 
27 insert (nn,s|i].key,v) 
28 s[i].end —v 
29 for i= 1 to m; do 
30 insert (nn,n{i].key,v) 
31 nli].end —v 
32 flush (nn) 
33 flush (n,s) 
34 end 


35 begin copy_from_sibling (un, s, v) 
// Omitted for brevity 





36 end 





3.3.3 Deletion 


Deleting an entry is conceptually simple as it simply in- 
volves setting the end version number for the given key. 
It does not require deleting any data as that is handled 
by GC. However, in order to bound the number of live 
blocks in the B-Tree and improve space utilization, we 
shift live entries if the number of live entries per node 
reaches m), a threshold defined in Section 3.3.6. The only 
exception is the root node as, due to a lack of siblings, 
shifting within the same level is not feasible. However, 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


67 


68 


5 | 10 | 20 [Merge] 30 | 40 | 99 
(4,-) | [5.-) | [4.8) (7.9) | [4.-) | 14) 
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(4,10) 
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Figure 3: CDDS node merge during deletion 
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Figure 4: CDDS B-Tree after Garbage Collection 













as described in Section 3.3.4, if the root only contains 
one live entry, the child will be promoted. 

As shown in Algorithm 3, we first check if the sibling 
has at least 3 x m, live entries and, if so, we copy m, live 
entries from the sibling to form a new node. As the leaf 
has m; live entries, the new node will have 2 x m, live 
entries. If that is not the case, we check if the sibling 
has enough space to copy the live entries. Otherwise, 
as shown in Figure 3, we merge the two nodes to create 
a new node containing the live entries from the leaf and 
sibling nodes. The number of live entries in the new node 
will be > 2 x m;. The inner nodes are updated with point- 
ers to the newly created nodes and, after the changes have 
been flushed to persistent memory, the current consistent 
version is incremented. 


3.3.4 Garbage Collection 


As shown in Section 3.3.3, the size of the B-Tree does 
not decrease when keys are deleted and can increase 
due to the creation of new nodes. To reduce the space 
overhead, we therefore use a periodic GC procedure, 
currently implemented using a mark-and-sweep garbage 
collector [8]. The GC procedure first selects the latest 
version number that can be safely garbage collected. It 
then starts from the root of the B-Tree and deletes nodes 
which contain dead and unreferenced entries by inval- 
idating the parent pointer to the deleted node. If the 
root node contains only one live entry after garbage col- 
lection, the child pointed to by the entry is promoted. 
This helps reduce the height of the B-Tree. As seen in 
the transformation of Figure | to the reduced-height tree 
shown in Figure 4, only live nodes are present after GC. 


3.3.5 Failure Recovery 


The recovery procedure for the B-Tree is similar to 
garbage collection. In this case, nodes newer than the 
more recent consistent version are removed and older 
nodes are recursively analyzed for partial updates. The 
recovery function performs a physical ‘undo’ of these 
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updates and ensures that the tree is physically and log- 
ically identical to the most recent consistent version. 
While our current recovery implementation scans the en- 
tire data structure, the recovery process is fast as it op- 
erates at memory bandwidth speeds and only needs to 
verify CDDS metadata. 


3.3.6 Space Analysis 


In the CDDS B-Tree, space utilization can be character- 
ized by the number of live blocks required to store N 
key-value pairs. Since the values are only stored in the 
leaf nodes, we analyze the maximum number of live leaf 
nodes present in the tree. In the CDDS B-Tree, a new 
node is created by an insert or delete operation. As de- 
scribed in Sections 3.3.2 and 3.3.3, the minimum number 
of live entries in new nodes is 2 x m,. 

When the number of live entries in a node reaches my, 
it is either merged with a sibling node or its live entries 
are copied to a new node. Hence, the number of live 
entries in a node is > m;. Therefore, in a B-Tree with 
N live keys, the maximum number of live leaf nodes is 
bound by O( ms Choosing m, as k, where k is the size of 
a B-Tree node, the maximum number of live leaf nodes 
is O(°). 

For each live leaf node, there is a corresponding en- 
try in the parent node. Since the number of live en- 
tries in an inner node is also > m;, the number of parent 


SN 
nodes required is O (=) = O(a). Extending this, 
5 


we can see that the height of the CDDS B-Tree is bound 
by O(log. N). This also bounds the time for all B-Tree 


operations. 


3.4 CDDS Discussion 


Apart from the CDDS B-Tree operations described 
above, the implementation also supports additional fea- 
tures including iterators and range scans. We believe that 
CDDS versioning also lends itself to other powerful fea- 
tures such as instant snapshots, rollback for programmer 
recovery, and integrated NVBM wear-leveling. We hope 
to explore these issues in our future work. 

We also do not anticipate the design of a CDDS 
preventing the implementation of different concurrency 
schemes. Our current CDDS B-Tree implementation 
uses a multiple-reader, single-writer model. However, 
the use of versioning lends itself to more complex con- 
currency control schemes including multi-version con- 
currency control (MVCC) [6]. While beyond the scope 
of this paper, exploring different concurrency control 
schemes for CDDSs is a part of our future work. 

CDDS-based systems currently depend on virtual 
memory mechanisms to provide fault-isolation and like 
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other services, it depends on the OS for safety. There- 
fore, while unlikely, placing NVBM on the memory bus 
can expose it to accidental writes from rogue DMAs. 
In contrast, the narrow traditional block device interface 
makes it harder to accidentally corrupt data. We believe 
that hardware memory protection, similar to IOMMUs, 
will be required to address this problem. Given that we 
map data into an application’s address space, stray writes 
from a buggy application could also destroy data. While 
this is no different from current applications that mmap 
their data, we are developing lightweight persistent heaps 
that use virtual memory protection with a RVM-style 
API [43] to provide improved data safety. 

Finally, apart from multi-version data structures [4, 
50], CDDSs have also been influenced by Persistent Data 
Structures (PDSs) [17]. The “Persistent” in PDS does 
not actually denote durability on persistent storage but, 
instead, represents immutable data structures where an 
update always yields a new data structure copy and never 
modifies previous versions. The CDDS B-Tree presented 
above is a weakened form of semi-persistent data struc- 
tures. We modify previous versions of the data struc- 
ture for efficiency but are guaranteed to recover from 
failure and rollback to a consistent state. However, the 
PDS concepts are applicable, in theory, to all linked data 
structures. Using PDS-style techniques, we have imple- 
mented a proof-of-concept CDDS hash table and, as ev- 
idenced by previous work for functional programming 
languages [35], we are confident that CDDS versioning 
techniques can be extended to a wide range of data struc- 
tures. 


3.5  Tembo: A CDDS Key-Value Store 


We created Tembo, a CDDS Key-Value (KV) store, to 
evaluate the effectiveness of a CDDS-based data store. 
The system involves the integration of the CDDS-based 
B-Tree described in Section 3.3 into Redis [41], a widely 
used event-driven KV store. As our contribution is not 
based around the design of this KV system, we only 
briefly describe Tembo in this section. As shown in Sec- 
tion 4.4, the integration effort was minor and leads us to 
believe that retrofitting CDDS into existing applications 
will be straightforward. 


The base architecture of Redis is well suited for a 
CDDS as it retains the entire data set in RAM. This also 
allows an unmodified Redis to serve as an appropriate 
performance baseline. While persistence in the original 
system was provided by a write-ahead append-only log, 
this is eliminated in Tembo because of the CDDS B-Tree 
integration. For fault-tolerance, Tembo provides master- 
slave replication with support for hierarchical replication 
trees where a slave can act as the master for other repli- 
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cas. Consistent hashing [27] is used by client libraries to 
distribute data in a Tembo cluster. 


4 Evaluation 


In this section, we evaluate our design choices in build- 
ing Consistent and Durable Data Structures. First, we 
measure the overhead associated with techniques used to 
achieve durability on existing processors. We then com- 
pare the CDDS B-tree to Berkeley DB and against log- 
based schemes. After briefly discussing CDDS imple- 
mentation and integration complexity, we present results 
from a multi-node distributed experiment where we use 
the Yahoo Cloud Serving Benchmark (YCSB) [15]. 


4.1 Evaluation Setup 


As NVBM is not commercially available yet, we used 
DRAM-based servers. While others [14] have shown 
that DRAM-based results are a good predictor of NVBM 
performance, as a part of our ongoing work, we aim 
to run micro-architectural simulations to confirm this 
within the context of our work. Our testbed consisted 
of 15 servers with two Intel Xeon Quad-Core 2.67 GHz 
(X5550) processors and 48 GB RAM each. The ma- 
chines were connected via a full-bisection Gigabit Eth- 
ernet network. Each processor has 128 KB L1, 256 KB 
L2, and 8 MB L3 caches. While each server contained 
8 300 GB 10K SAS drives, unless specified, all experi- 
ments were run directly on RAM or on a ramdisk. We 
used the Ubuntu 10.04 Linux distribution and the 2.6.32- 
24 64-bit kernel. 


4.2 Flush Performance 


To accurately capture the performance of the flush 
operation defined in Section 3.1, we used the “Mult- 
CallFlushLRU” methodology [53]. The experiment al- 
locates 64 MB of memory and subdivides it into equally- 
sized cache-aligned objects. Object sizes ranged from 
64 bytes to 64 MB. We write to every cache line in an 
object, flush the entire object, and then repeat the pro- 
cess with the next object. For improved timing accuracy, 
we stride over the memory region multiple times. 
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Remembering that each flush is a number of 
clflushes bracketed by mfences on both sides, Fig- 
ure 5 shows the number of cl flushes executed per sec- 
ond. Flushing small objects sees the worst performance 
(~12M cacheline flushes/sec for 64 byte objects). For 
larger objects (256 bytes—8 MB), the performance ranges 
from ~16M-20M cacheline flushes/sec. 

We also observed an unexpected drop in performance 
for large objects (>8 MB). Our analysis showed that 
this was due to the cache coherency protocol. Large 
objects are likely to be evicted from the L3 cache be- 
fore they are explicitly flushed. A subsequent clflush 
would miss in the local cache and cause a high-latency 
“snoop” request that checks the second off-socket pro- 
cessor for the given cache line. As measured by the 
UNC_SNP_RESP_TO_REMOTE_HOME.I_STATE per- 
formance counter, seen in Figure 6, the second socket 
shows a corresponding spike in requests for cache lines 
that it does not contain. To verify this, we physically re- 
moved a processor and observed that the anomaly disap- 
peared*. Further, as we could not replicate this slowdown 
on AMD platforms, we believe that cache-coherency 
protocol modifications can address this anomaly. 

Overall, the results show that we can flush 0.72- 
1.19 GB/s on current processors. For applications with- 
out networking, Section 4.3 shows that future hardware 
support can help but applications using flush can still 
outperform applications that use file system sync calls. 
Distributed applications are more likely to encounter net- 
work bottlenecks before flush becomes an overhead. 


4.3, API Microbenchmarks 


This section compares the CDDS B-Tree performance 
for puts, gets, and deletes to Berkeley DB’s (BDB) B- 
Tree implementation [36]. For this experiment, we in- 
sert, fetch, and then delete 1 million key-value tuples 


4We did not have physical access to the experimental testbed and 
ran the processor removal experiment on a different dual-socket Intel 
Xeon (X5570) machine. 
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Lines of Code 
Original STX B-Tree 2,110 
CDDS Modifications 1,902 
Redis (v2.0.0-rc4) 18,539 
Tembo Modifications 321 





Table 2: Lines of Code Modified 


into each system. After each operation, we flush the 
CPU cache to eliminate any variance due to cache con- 
tents. Keys and values are 25 and 2048 bytes large. The 
single-threaded benchmark driver runs in the same ad- 
dress space as BDB and CDDS. BDB’s cache size was 
set to 8 GB and could hold the entire data set in memory. 
Further, we configure BDB to maintain its log files on an 
in-memory partition. 

We run both CDDS and BDB (v4.8) in durable and 
volatile modes. For BDB volatile mode, we turn transac- 
tions and logging off. For CDDS volatile mode, we turn 
flushing off. Both systems in volatile mode can lose 
or corrupt data and would not be used where durability is 
required. We only present the volatile results to highlight 
predicted performance if hardware support was available 
and to discuss CDDS design tradeoffs. 

The results, displayed in Figure 7, show that, for 
memory-backed BDB in durable mode, the CDDS B- 
Tree improves throughout by 74%, 138%, and 503% for 
puts, gets, and deletes respectively. These gains come 
from not using a log (extra writes) or the file system in- 
terface (system call overhead). CDDS delete improve- 
ment is larger than puts and gets because we do not delete 
data immediately but simply mark it as dead and use GC 
to free unreferenced memory. In results not presented 
here, reducing the value size, and therefore the log size, 
improves BDB performance but CDDS always performs 
better. 

If zero-overhead epoch-based hardware support [14] 
was available, the CDDS volatile numbers show that per- 
formance of puts and deletes would increase by 80% and 
27% as £1ushes would never be on the critical path. We 
do not observe any significant change for gets as the only 
difference between the volatile and durable CDDS is that 
the flush operations are converted into a noop. 

We also notice that while volatile BDB throughput is 
lower than durable CDDS for gets and dels by 52% and 
41%, it is higher by 56% for puts. Puts are slower for the 
CDDS B-Tree because of the work required to maintain 
key ordering (described in Section 3.3.1), GC overhead, 
and a slightly higher height due to nodes with a mixture 
of live and dead entries. Volatile BDB throughput is also 
higher than durable BDB but lower than volatile CDDS 
for all operations. 

Finally, to measure versioning overhead, we compared 
the volatile CDDS B-Tree to a normal B-Tree [7]. While 
not presented in Figure 7, volatile CDDS’s performance 
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was lower than the in-memory B-Tree by 24%, 13%, and 
39% for puts, gets, and dels. This difference is similar to 
other performance-optimized versioned B-trees [45]. 


4.4 Implementation Effort 


The CDDS B-Tree started with the STX C++ B-Tree [7] 
implementation but, as measured by sloccount and 
shown in Table 2, the addition of versioning and NVBM 
durability replaced 90% of the code. While the API 
remained the same, the internal implementation differs 
substantially. The integration with Redis to create Tembo 
was simpler and only changed 1.7% of code and took 
less than a day to integrate. Since the CDDS B-Tree 
implements an interface similar to an STL Sorted Con- 
tainer, we believe that integration with other systems 
should also be simple. Overall, our experiences show 
that while the initial implementation complexity is mod- 
erately high, this only needs to be done once for a given 
data structure. The subsequent integration into legacy or 
new systems is straightforward. 


4.5 Tembo Versioning vs. Redis Logging 


Apart from the B-Tree specific logging performed by 
BDB in Section 4.3, we also wanted to compare CDDS 
versioning when integrated into Tembo to the write- 
ahead log used by Redis in fully-durable mode. Redis 
uses a hashtable and, as it is hard to compare hashta- 
bles and tree-based data structures, we also replaced the 
hashtable with the STX B-Tree. In this single-node ex- 
periment, we used 6 Tembo or Redis data stores and 2 
clients>. The write-ahead log for the Redis server was 
stored on an in-memory partition mounted as tmpfs and 
did not use the hard disk. Each client performed 1M in- 
serts over the loopback interface. 

The results, presented in Figure 8, show that as the 
value size is increased, Tembo performs up to 30% better 


5Being event-driven, both Redis and Tembo are single-threaded. 
Therefore one data store (or client) is run per core in this experiment. 
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than Redis integrated with the STX B-Tree. While Re- 
dis updates the in-memory data copy and also writes to 
the append-only log, Tembo only updates a single copy. 
While hashtable-based Redis is faster than Tembo for 
256 byte values because of faster lookups, even with the 
disadvantage of a tree-based structure, Tembo’s perfor- 
mance is almost equivalent for 1 KB values and is 15% 
faster for 4 KB values. 

The results presented in this section are lower than the 
improvements in Section 4.3 because of network latency 
overhead. The fsync implementation in tmpfs also 
does not explicitly flush modified cache lines to mem- 
ory and is therefore biased against Tembo. We are work- 
ing on modifications to the file system that will enable a 
fairer comparison. Finally, some of the overhead is due 
to maintaining ordering properties in the CDDS-based 
B-Tree to support range scans - a feature not used in the 
current implementation of Tembo. 


4.6 End-to-End Comparison 


For an end-to-end test, we used YCSB, a framework for 
evaluating the performance of Key-Value, NoSQL, and 
cloud storage systems [15]. In this experiment, we used 
13 servers for the cluster and 2 servers as the clients. 
We extended YCSB to support Tembo, and present re- 
sults from two of YCSB’s workloads. Workload-A, re- 
ferred to as SessionStore in this section, contains a 50:50 
read:update mix and is representative of tracking recent 
actions in an online user’s session. Workload-D, referred 
to as StatusUpdates, has a 95:5 read:insert mix. It rep- 
resents people updating their online status (e.g., Twitter 
tweets or Facebook wall updates) and other users reading 
them. Both workloads execute 2M operations on values 
consisting of 10 columns with 100 byte fields. 

We compare Tembo to Cassandra (v0.6.1) [29], 
a distributed data store that borrows concepts from 
BigTable [10] and Dynamo [16]. We used three differ- 
ent Cassandra configurations in this experiment. The 
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first two used a ramdisk for storage but the first (Cassan- 
dra/Mem/Durable) flushed its commit log before every 
update while the second (Cassandra/Mem/Volatile) only 
flushed the log every 10 seconds. For completeness, we 
also configured Cassandra to use a disk as the backing 
store (Cassandra/Disk/Durable). 

Figure 9 presents the aggregate throughput for the 
SessionStore benchmark. With 30 client threads, 
Tembo’s throughput was 286% higher than memory- 
backed durable Cassandra. Given Tembo and Cas- 
sandra’s different design and implementation choices, 
the experiment shows the overheads of Cassandra’s in- 
memory “memtables,” on-disk “SSTables,” and a write- 
ahead log, vs. Tembo’s single-level store. Disk-backed 
Cassandra’s throughput was only 22-44% lower than the 
memory-backed durable configuration. The large num- 
ber of disks in our experimental setup and a 512 MB 
battery-backed disk controller cache were responsible 
for this better-than-expected disk performance. On a 
different machine with fewer disks and a smaller con- 
troller cache, disk-backed Cassandra bottlenecked with 
10 client threads. 

Figure 10 shows that, for the StatusUpdates workload, 
Tembo’s throughput is up to 250% higher than memory- 
backed durable Cassandra. Tembo’s improvement is 
slightly lower than the SessionStore benchmark because 
StatusUpdates insert operations update all 10 columns 
for each value, while the SessionStore only selects one 
random column to update. Finally, as the entire data set 
can be cached in memory and inserts represent only 5% 
of this workload, the different Cassandra configurations 
have similar performance. 


5 Conclusion and Future Work 


Given the impending shift to non-volatile byte- 
addressable memory, this work has presented Consistent 
and Durable Data Structures (CDDSs), an architecture 
that, without processor modifications, allows for the cre- 
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ation of log-less storage systems on NVBM. Our results 
show that redesigning systems to support single-level 
data stores will be critical in meeting the high-throughput 
requirements of emerging applications. 

We are currently also working on extending this work 
in a number of directions. First, we plan on leverag- 
ing the inbuilt CDDS versioning to support multi-version 
concurrency control. We also aim to explore the use of 
relaxed consistency to further optimize performance as 
well as integration with virtual memory to provide bet- 
ter safety against stray application writes. Finally, we 
are investigating the integration of CDDS versioning and 
wear-leveling for better performance. 
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Abstract 


Although Flash Memory based Solid State Drive (SSD) 
exhibits high performance and low power consumption, 
a critical concern is its limited lifespan along with the 
associated reliability issues. In this paper, we propose to 
build a Content-Aware Flash Translation Layer (CAFTL) 
to enhance the endurance of SSDs at the device level. 
With no need of any semantic information from the host, 
CAFTL can effectively reduce write traffic to flash mem- 
ory by removing unnecessary duplicate writes and can 
also substantially extend available free flash memory 
space by coalescing redundant data in SSDs, which fur- 
ther improves the efficiency of garbage collection and 
wear-leveling. In order to retain high data access per- 
formance, we have also designed a set of acceleration 
techniques to reduce the runtime overhead and mini- 
mize the performance impact caused by extra computa- 
tional cost. Our experimental results show that our solu- 
tion can effectively identify up to 86.2% of the duplicate 
writes, which translates to a write traffic reduction of up 
to 24.2% and extends the flash space by a factor of up 
to 31.2%. Meanwhile, CAFTL only incurs a minimized 
performance overhead by a factor of up to 0.5%. 


1 Introduction 


The limited lifespan is the Achilles’ heel of Flash Mem- 
ory based Solid State Drives (SSDs). On one hand, SSDs 
built on semiconductor chips without any moving parts 
have exhibited many unique technical merits compared 
with hard disk drives (HDDs), particularly high random 
access performance and low power consumption. On the 
other hand, the limited lifespan of SSDs, which are built 
on flash memories with limited erase/program cycles, is 
still one of the most critical concerns that seriously hin- 
der a wide deployment of SSDs in reliability-sensitive 
environments, such as data centers [10]. Although SSD 
manufacturers often claim that SSDs can sustain rou- 
tine usage for years, the technical concerns about the en- 
durance issues of SSDs still remain high. This is mainly 
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due to three not-so-well-known reasons. First, as bit den- 
sity increases, flash memory chips become more afford- 
able but, at the same time, less reliable and less durable. 
In the last two years, for high-density flash devices, we 
have seen a sharp drop of erase/program cycle ratings 
from ten thousand to five thousand cycles [7]. As tech- 
nology scaling continues, this situation could become 
even worse. Second, traditional redundancy solutions 
such as RAID, which have been widely used for battling 
disk failures, are considered less effective for SSDs, be- 
cause of the high probability of correlated device failures 
in SSD-based RAID [9]. Finally, although some prior 
research work [13, 22, 33] has presented empirical and 
modeling-based studies on the lifespan of flash memo- 
ries and USB flash drives, both positive and negative re- 
sults have been reported. In fact, as a recent report from 
Google® points out, “endurance and retention (of SSDs) 
not yet proven in the field” [10]. 

All these aforesaid issues explain why commercial 
users hesitate to perform a large-scale deployment of 
SSDs in production systems and why integrating SSDs 
into commercial systems is proceeding such “painfully 
slowly” [10]. In order to integrate such a “frustrating 
technology”, which comes with equally outstanding mer- 
its and limits, into the existing storage hierarchy timely 
and reliably, solutions for effectively improving the lifes- 
pan of SSDs are highly desirable. In this paper, we pro- 
pose such a solution from a unique and viable angle. 


1.1 Background of SSDs 

1.1.1 Flash memory and SSD internals 

NAND flash memory is the basic building block of most 
SSDs on the market. A flash memory package is usu- 
ally composed of one or multiple dies (chips). Each die 
is segmented into multiple planes, and a plane is further 
divided into thousands (e.g. 2048) of erase blocks. An 
erase block usually consists of 64-128 pages. Each page 
has a data area (e.g. 4KB) and a spare area (a.k.a. meta- 
data area). Flash memories support three major opera- 
tions. Read and write (a.k.a. program) are performed in 
units of pages, and erase, which clears all the pages in an 
erase block, must be conducted in erase blocks. 


FAST °11: 9th USENIX Conference on File and Storage Technologies 


77 


78 


Flash memory has three critical technical constraints: 
(1) No in-place overwrite — the whole erase block must 
be erased before writing (programming) any page in this 
block. (2) No random writes — the pages in an erase block 
must be written sequentially. (3) Limited erase/program 
cycles — an erase block can wear out after a certain num- 
ber of erase/program cycles (typically 10,000-100,000). 

As a critical component in the SSD design, the Flash 
Translation Layer (FTL) is implemented in the SSD con- 
troller to emulate a hard disk drive by exposing an array 
of logical block addresses (LBAs) to the host. In order 
to address the aforesaid three constraints, the FTL de- 
signers have developed several sophisticated techniques: 
(1) Indirect mapping — A mapping table is maintained 
to track the dynamic mapping between logical block ad- 
dresses (LBAs) and physical block addresses (PBAs). 
(2) Log-like write mechanism — Each write to a logical 
page only invalidates the previously occupied physical 
page, and the new content data is appended sequentially 
in a clean erase block, like a log, which is similar to the 
log-structured file system [41]. (3) Garbage collection 
— A garbage collector (GC) is launched periodically to 
recycle invalidated physical pages, consolidate the valid 
pages into a new erase block, and clean the old erase 
block. (4) Wear-leveling — Since writes are often con- 
centrated on a subset of data, which may cause some 
blocks to wear out earlier than the others, a wear-leveling 
mechanism tracks and shuffles hot/cold data to even out 
writes in flash memory. (5) Over-provisioning — In or- 
der to assist garbage collection and wear-leveling, SSD 
manufacturers usually include a certain amount of over- 
provisioned spare flash memory space in addition to the 
host-usable SSD capacity. 


1.1.2 The lifespan of SSDs 


As flash memory has a limited number of erase/program 
cycles, the lifespan of SSDs is naturally constrained. In 
essence, the lifespan of SSDs is a function of three fac- 
tors: (1) The amount of incoming write traffic — The less 
data written into an SSD, the longer the lifespan would 
be. In fact, the SSD manufacturers often advise commer- 
cial users, whose systems undergo intensive write traffic 
(e.g. an email server), to purchase more expensive high- 
end SSDs. (2) The size of over-provisioned flash space 
— A larger over-provisioned flash space provides more 
available clean flash pages in the allocation pool that can 
be used without triggering a garbage collection. Aggres- 
sive over-provisioning can effectively reduce the average 
number of writes over all flash pages, which in turn im- 
proves the endurance of SSDs. For example, the high- 
end Intel® X25-E SSD is aggressively over-provisioned 
with about 8GB flash space, which is 25% of the labeled 
SSD capacity (32GB) [25]. (3) The efficiency of garbage 


collection and wear-leveling mechanisms — Having been 
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extensively researched, the garbage collection and wear- 
leveling policies can significantly impact the lifespan of 
SSDs. For example, static wear-leveling, which swaps 
active blocks with randomly chosen inactive blocks, per- 
forms better in endurance than dynamic wear-leveling, 
which only swaps active blocks [13]. 

Most previous research work [21] focuses on the third 
factor, garbage collection and wear-leveling policies. A 
survey [21] summarizes these techniques. In contrast, 
little study has been conducted on the other two aspects. 
This may be because incoming write traffic is normally 
believed to be workload dependent, which cannot be 
changed at the device level, and the over-provisioning 
of flash space is designated at the manufacturing process 
and cannot be excessively large (due to the production 
cost). In this paper we will show that even at the SSD 
device level, we can still effectively extend the SSD lifes- 
pan by reducing the amount of incoming write traffic and 
squeezing available flash memory space during runtime, 
which has not been considered before. This goal can be 
achieved based on our observation of a widely existing 
phenomenon - data duplication. 


1.2. Data Duplication is Common 


In file systems data duplication is very common. For ex- 
ample, kernel developers can have multiple versions of 
Linux source code for different projects. Users can cre- 
ate/delete the same files multiple times. Another exam- 
ple is word editing tools, which often automatically save 
a copy of documents every few minutes, and the content 
of these copies can be almost identical. 
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Figure 1: The percentage of redundant data in disks. 
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To make a case here, we have studied 15 disks in- 
stalled on 5 machines in the Department of Computer 
Science and Engineering at the Ohio State University. 
Three file systems can be found in these disks, namely 
Ext2, Ext3, and NTFS. The disks are used in different en- 
vironments, 4 disks from Database/Web Servers, 7 disks 
from Experimental Systems for kernel development, and 
the other 4 disks from Office Systems. We slice the disk 
space into 4KB blocks and use the SHA-1 hash func- 
tion [1] to calculate a 160-bit hash value for each block. 
We can identify duplicate blocks by comparing the hash 
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values. Figure 1 shows the duplication rates (i.e. the 
percentage of duplicate blocks in total blocks). 

In Figure 1, we find that the duplication rate ranges 
from 7.9% to 85.9% across the 15 disks. We also find 
that in only one disk with NTFS, the duplicate blocks 
are dominated by ‘zero’ blocks. The duplicate blocks 
on the other disks are mostly non-zero blocks, which 
means that these duplicate blocks contain ‘meaningful’ 
data. Considering the fact that a typical SSD has an over- 
provisioned space of only 1-20% of the flash memory 
space, removing the duplicate data, which accounts for 
7.9-85.9% of the SSD capacity, can substantially extend 
the available flash space that can be used for garbage col- 
lection and wear-leveling. If this effort is successful, we 
can raise the performance comparable to that of high-end 
SSDs with no need of extra flash space. 
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Figure 2: The perc. of duplicate writes in workloads. 

Besides the static analysis of the data redundancy in 
storage, we have also collected I/O traces and analyzed 
the data accesses of 11 workloads from three categories 
(see more details in Section 4). For each workload, we 
modified the Linux kernel by intercepting each I/O re- 
quest and calculating a hash value for each requested 
block. We analyzed the I/O traces off-line. Figure 2 
shows the percentage of the duplicate writes in each 
workload. We can find that 5.8-28.1% of the writes are 
duplicated. This finding suggests that if we remove these 
duplicate writes, we can effectively reduce the write traf- 
fic into flash medium, which directly improves the en- 
durance accordingly, not to mention the indirect effect of 
reducing the number of extra writes caused by less fre- 
quently triggered garbage collections. 


1.3. Making FTL Content Aware 


Based on the above observations and analysis, we pro- 
pose a Content-Aware Flash Translation Layer (CAFTL) 
to integrate the functionality of eliminating duplicate 
writes and redundant data into SSDs to enhance the lifes- 
pan at the device level. 

CAFTL intercepts incoming write requests at the SSD 
device level and uses a collision-free cryptographic hash 
function to generate fingerprints summarizing the con- 
tent of updated data. By querying a fingerprint store, 


which maintains the fingerprints of resident data in the 
SSD, CAFTL can accurately and safely eliminate dupli- 
cate writes to flash medium. CAFTL also uses a two- 
level mapping mechanism to coalesce redundant data, 
which effectively extends available flash space and im- 
proves GC efficiency. In order to minimize the perfor- 
mance impact caused by computing hash values, we have 
also designed a set of acceleration methods to speed up 
fingerprinting. With these techniques, CAFTL can effec- 
tively reduce write traffic to flash, extend available flash 
space, while retaining high data access performance. 
CAFTL is an augmentation, rather than a complete re- 
placement, to the existing FTL designs. Being content- 
aware, CAFTL is orthogonal to the other FTL policies, 
such as the well researched garbage collection and wear- 
leveling policies. In fact, the existing mechanisms in the 
SSDs provide much needed facilities for CAFTL and 
make it a perfect fit in the existing SSD architecture. 
For example, the indirect mapping mechanism naturally 
makes associating multiple logical pages to one physical 
page easy to implement; the periodic scanning process 
for garbage collection and wear-leveling can also carry 
out an out-of-line deduplication asynchronously; the log- 
like write mechanism makes it possible to re-validate the 
‘deleted’ data without re-writing the same content; and 
finally, the semiconductor nature of flash memory makes 
reading randomly remapped data free of high latencies. 
CAFTL is also backward compatible and portable. 
Running at the device level as a part of SSD firmware, 
CAFTL does not need to change the standard host/device 
interface for passing any extra information from the 
upper-level components (e.g. file system) to the device. 
All of the design of CAFTL is isolated at the device level 
and hidden from users. This guarantees CAFTL as a 
drop-in solution, which is highly desirable in practice. 


1.4 Our Contributions 


We have made the following contributions in this paper: 
(1) We have studied data duplications in file systems and 
various workloads, and assessed the viability of improv- 
ing endurance of SSDs through deduplication. (2) We 
have carefully designed a content-aware FTL to extend 
the SSD lifespan by removing duplicate writes (up to 
24.2%) and redundant data (up to 31.2%) with minimal 
overhead. To the best of our knowledge, this is the first 
study using effective deduplication in SSDs. (3) We have 
also designed a set of techniques to accelerate the in-line 
deduplication in SSD devices, which are particularly ef- 
fective with small on-device buffer spaces (e.g. 2MB) 
and make performance overhead nearly negligible. (4) 
We have implemented CAFTL in the DiskSim simula- 
tor and comprehensively evaluated its performance and 
shown the effectiveness of improving the SSD lifespan 
through extensive trace-driven simulations. 
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The rest of this paper is organized as follows. In Sec- 
tion 2, we discuss the unique challenges in the design of 
CAFTL. Section 3 introduces the design of CAFTL and 
our acceleration methods. We present our performance 
evaluation in Section 4. Section 5 gives the related work. 
The last section discusses and concludes this paper. 


2 Technical Challenges 


CAFTL shares the same principle of removing data re- 
dundancy with Content-Addressable Storage (CAS), e.g. 
[11,24,30,45,47], which is designed for backup/archival 
systems. However, we cannot simply borrow CAS poli- 
cies in our design due to four unique and unaddressed 
challenges: (1) Limited resources - CAFTL is designed 
for running in an SSD device with limited memory space 
and computing power, rather than running on a dedi- 
cated powerful enterprise server. (2) Relatively lower re- 
dundancy — CAFTL mostly handles regular file system 
workloads, which have an impressive but much lower 
duplication rate than that of backup streams with high 
redundancy (often 10 times or even higher). (3) Lack of 
semantic hints - CAFTL works at the device level and 
only sees a sequence of logical blocks without any se- 
mantic hints from host file systems. (4) Low overhead 
requirement —- CAFTL must retain high data access per- 
formance for regular workloads, while this is a less strin- 
gent requirement in backup systems that can run during 
out-of-office hours. 

All of these unique requirements make deduplication 
particularly challenging in SSDs and it requires non- 
trivial efforts to address them in the CAFTL design. 


3 The Design of CAFTL 


The design of CAFTL aims to reach the following three 
critical objectives. 


¢ Reducing unnecessary write traffic — By examining the 
data of incoming write requests, we can detect and re- 
move duplicate writes in-line, so that we can effec- 
tively filter unnecessary writes into flash memory and 
directly improve the lifespan of SSDs. 


¢ Extending available flash space — By leveraging the 
indirect mapping framework in SSDs, we can map log- 
ical pages sharing the same content to the same phys- 
ical page. The saved space can be used for GC and 
wear-leveling, which indirectly improves the lifespan. 


¢ Retaining access performance — A critical requirement 
to make CAFTL truly effective in practice is to avoid 
significant negative performance impacts. We must 
minimize runtime overhead and retain high data ac- 
cess performance. 
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3.1 Overview of CAFTL 


CAFTL eliminates duplicate writes and redundant data 
through a combination of both in-line and out-of-line 
(a.k.a post-processing or out-of-band) deduplication. In- 
line deduplication refers to the case where CAFTL 
proactively examines the incoming data and cancels du- 
plicate writes before committing a write request to flash. 
As a ‘best-effort’ solution, CAFTL does not guarantee 
that all duplicate writes can be examined and removed 
immediately (e.g. it can be disabled for performance pur- 
poses). Thus CAFTL also periodically scans the flash 
memory and coalesces redundant data out of line. 
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Figure 3: An illustration of CAFTL architecture. 


Figure 3 illustrates the process of handling a write re- 
quest in CAFTL — When a write request is received at 
the SSD, (1) the incoming data is first temporarily main- 
tained in the on-device buffer; (2) each updated page in 
the buffer is later computed a hash value, also called fin- 
gerprint, by a hash engine, which can be a dedicated 
processor or simply a part of the controller logic; (3) 
each fingerprint is looked up against a fingerprint store, 
which maintains the fingerprints of data already stored in 
the flash memory; (4) if a match is found, which means 
that a residing data unit holds the same content, the map- 
ping tables, which translate the host-viewable logical ad- 
dresses to the physical flash addresses, are updated by 
mapping it to the physical location of the residing data, 
and correspondingly the write to flash is canceled; (5) 
if no match is found, the write is performed to the flash 
memory as a regular write. 


3.2 Hashing and Fingerprint Store 


CAFTL attempts to identify and remove duplicate writes 
and redundant data. A byte-by-byte comparison is exces- 
sively slow. A common practice is to use a cryptographic 
hash function, e.g. SHA-1 [1] or MD5 [40], to compute a 
collision-free hash value as a fingerprint. Duplicate data 
can be determined by comparing fingerprints. Here we 
explain how we produce and manage fingerprints. 
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3.2.1 Choosing hashing units 


CAFTL uses a chunk-based deduplication approach. Un- 
like most CAS systems, which often use more compli- 
cated variable-sized chunking, CAFTL adopts a fixed- 
sized chunking approach for two reasons. First, the 
variable-sized chunking is designed for segmenting a 
long I/O stream. In CAFTL, we handle a sequence of 
individual requests, whose size can be very small (a few 
kilobytes) and vary significantly. Thus variable-sized 
chunking is inappropriate for CAFTL. Second, the basic 
operation unit in flash is a page (e.g. 4KB), and the inter- 
nal management policies in SSDs, such as the mapping 
policy, are also designed in units of pages. Thus, using 
pages as the fixed-sized chunks for hashing is a natural 
choice and also avoids unnecessary complexity. 


3.2.2 Hash function and fingerprints 


In order to identify duplicate data, a collision-free hash 
function is used for summarizing the content of pages. 
We use the SHA-1 [1], a widely used cryptographic hash 
function, and rely on its collision-resistant properties to 
index and compare pages. For each page, we calculate 
a 160-bit hash value as its fingerprint and store it as the 
page’s metadata in flash. The SHA-1 hash function has 
been proven computationally infeasible to find two dis- 
tinct inputs hashing to the same value [32]. We can safely 
determine if two pages are identical using fingerprints. 


3.2.3. The fingerprint store 
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Figure 4: The CDF figure of duplicate fingerprints. 

In order to locate quickly the physical page with a spe- 
cific fingerprint, CAFTL manages an in-memory struc- 
ture, called Fingerprint Store. Apparently, keeping all 
fingerprints and related information (25 bytes each) in 
memory is too costly and unnecessary. We have stud- 
ied the distribution of fingerprints in the 15 disks and 
we plot a Cumulative Distribution Function (CDF) fig- 
ure in Figure 4. We can see that the distribution of dupli- 
cated fingerprints is skewed — only 10-20% of the finger- 
prints are highly duplicated (more than 2). This finding 
provides two implications. First, most fingerprints are 
unique and never have a chance to match any queried 


fingerprint. Second, a complete search in the fingerprint 
store would incur high lookup latencies, and even worse, 
most lookups eventually turn out to be useless (no match 
found). Thus, we should only store and search in the 
most likely-to-be-duplicated fingerprints in memory. 


We first logically partition the hash value space into 
N segments. For a given fingerprint, f, we can map it 
to segment (f mod N), and the random nature of the 
hash function guarantees an even distribution of finger- 
prints among the segments. Each segment contains a list 
of buckets. Each bucket is a 4KB page in memory and 
consists of multiple entries, each of which is a key-value 
pair, {fingerprint, (location, reference)}. The 160-bit fin- 
gerprint indexes the entry; the 32-bit location denotes 
where we can find the data, either the PBA of a physi- 
cal flash page or the VBA of a secondary mapping entry 
(see Section 3.3); the 8-bit reference denotes the hotness 
of this fingerprint (i.e. the number of referencing logical 
pages). The entries in each bucket are sorted in the as- 
cending order of their fingerprint values to facilitate a fast 
in-bucket binary search. The total numbers of buckets 
and segments are designated by the SSD manufacturers. 


The fingerprint store maintains the most highly refer- 
enced fingerprints in memory. During the SSD startup 
time, after the mapping tables are built up (to be dis- 
cussed in Section 3.3), the fingerprint store is also recon- 
structed by scanning the mapping tables and the meta- 
data in flash to load the key value pairs of {fingerprint, 
(location, reference)} into memory. Initially no bucket 
is allocated in the fingerprint store. Upon inserting a fin- 
gerprint, an empty bucket is allocated and linked into a 
bucket list of the corresponding segment. This bucket 
holds the fingerprints inserted into the corresponding 
segment until the bucket is filled up, then we allocate 
another bucket. We continue to allocate buckets in this 
way until there are no more free buckets available. If that 
happens, the newly inserted fingerprint will replace the 
fingerprint with the smallest reference counter (i.e. the 
coldest one) in the bucket, unless its reference counter is 
smaller than any of the resident fingerprints. Note that 
we choose the inserting bucket in a round-robin manner 
to ensure a relatively even distribution of hot/cold finger- 
prints across the buckets in a segment. It is also worth 
mentioning here that a 8-bit reference counter is suffi- 
ciently large for distinguishing the hot fingerprints, be- 
cause most fingerprints have a reference counter smaller 
than 255 (see Figure 4). We consider fingerprints with 
a reference counter larger than 255 as highly referenced 
and do not further distinguish their difference in hotness. 
In this way, we can include the most highly referenced 
fingerprints in memory. Although we may miss some 
opportunities of identifying the duplicates whose finger- 
prints are not resident in memory, this probability is con- 
sidered low (as shown in Figure 4), and we are not pursu- 
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ing a perfect in-line deduplication. Our out-of-line scan- 
ning can still identify these duplicates later. 


Searching a fingerprint can be very simple. We com- 
pute the mapping segment number and scan the corre- 
sponding list of buckets one by one. In each bucket, 
we use binary search to speed up the in-bucket lookup. 
However, for a segment with a large set of buckets, 
this method is still improvable. We have designed three 
optimization techniques to further accelerate fingerprint 
lookups. (1) Range Check — before performing the binary 
search in a bucket, we first compare the fingerprint with 
the smallest and the largest fingerprints in the buckets. If 
the fingerprint is out of the range, we quickly skip over 
this bucket. (2) Hotness-based Reorganization — the fin- 
gerprints in the linked buckets can be reorganized in the 
descending order of their reference counters. This moves 
the hot fingerprints closer to the list head and potentially 
reduces the number of the scanned buckets. (3) Bucket- 
level Binary Search — the fingerprints across the buckets 
can be reorganized in the ascending order of the finger- 
print values by using a merge sort. For each segment 
we maintain an array of pointers to the buckets in the 
list. We can perform a binary search at the bucket level 
by recursively selecting the bucket in the middle to do a 
Range Check. In this way we can quickly locate the tar- 
get bucket and skip over most buckets. Although reorga- 
nizing the fingerprints requires performing an additional 
merge sort, our experiments show that these optimiza- 
tions can significantly reduce the number of comparisons 
of fingerprint values. In Section 4.3.3 we will show and 
compare the effectiveness of the three techniques. 


3.3 Indirect Mapping 


Indirect mapping is a core mechanism in the SSD archi- 
tecture. SSDs expose an array of logical block addresses 
(LBAs) to the host, and internally, a mapping table is 
maintained to track the physical block address (PBA) to 
which each LBA is mapped. For CAFTL, the existing 
indirect mapping mechanism in SSDs provides a basic 
framework for deduplication and avoids rebuilding the 
whole infrastructure from scratch. 

On the other hand, the existing 1-to-1 mapping mecha- 
nism in SSDs cannot be directly used for CAFTL, which 
is essentially N-to-1 mapping, because of two new chal- 
lenges. (1) When a physical page is relocated to an- 
other place (e.g. in garbage collection), we must be able 
to identify quickly all the logical pages mapped to this 
physical page and update their mapping entries to point 
to the new location. (2) Since a physical page could be 
shared by multiple logical pages, it cannot be recycled 
by the garbage collector until all the referencing logical 
pages are demapped from it, which means that we must 
track the number of referencing logical pages. 
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Figure 5: An illustration of the indirect mapping. 


We have designed a new indirect mapping mechanism 
to address these aforementioned issues. As shown in Fig- 
ure 5, a conventional FTL uses a one-level indirect map- 
ping, from LBAs to PBAs. In CAFTL, we create another 
indirect mapping level, called Virtual Block Addresses 
(VBAs). A VBA is essentially a pseudo address name 
to represent a set of LBAs mapped to the same PBA. 
In this two-level indirect mapping structure, we can lo- 
cate the physical page for a logical page either through 
LBA->PBA or LBA->VBA-+PBA. 


We maintain a primary mapping table and a secondary 
mapping table in memory. The primary mapping table 
maps a LBA to either a PBA, if the logical page is unique, 
ora VBA, if it is a duplicate page. We differentiate PBAs 
and VBAs by using the most significant bit in the 32-bit 
page address. For a page size of 4KB, using the remain- 
ing 31 bits can address 8,192 GB storage space, which is 
sufficiently large for an SSD. The secondary mapping ta- 
ble maps a VBA to a PBA. Each entry is indexed by the 
VBA and has two fields, {PBA, reference}. The 32-bit 
PBA denotes the physical flash page, and the 32-bit ref- 
erence tracks the exact number of logical pages mapped 
to the physical page. Only physical pages without any 
reference can be recycled for garbage collection. 


This two-level indirect mapping mechanism has sev- 
eral merits. First, it significantly simplifies the reverse 
updates to the mapping of duplicate logical pages. When 
relocating a physical page during GC, we can use its 
associated VBA to quickly locate and update the sec- 
ondary mapping table by mapping the VBA to the new 
location (PBA), which avoids exhaustively searching for 
all the referencing LBAs in the huge primary mapping 
table. Second, the secondary mapping table can be 
very small. Since CAFTL handles regular file system 
workloads, most logical pages are unique and directly 
mapped through the primary table. We can also ap- 
ply an approach similar to DFTL [23] to further re- 
duce the memory demand by selectively maintaining the 
most frequently accessed entries of the mapping tables in 
memory. Finally, this incurs minimal additional lookup 
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overhead. For unique pages, it performs identically to 
conventional FTLs; for duplicate pages, only one extra 
memory access is needed for the lookup operation. 


3.3.2. The mapping tables in flash 


The mapping relationship is also maintained in flash 
memory. We keep an in-flash copy of the primary and 
secondary mapping tables along with a journal in ded- 
icated flash space in SSD. Both in-flash structures are 
organized as a list of linked physical flash pages. When 
updating the in-memory tables (e.g. remapping a LBA 
to a new location), the update record is logged into a 
small in-memory buffer. When the buffer is filled, the 
log records are appended to the in-flash journal. If power 
failure happens, a capacitor (e.g. a SuperCap [46]) can 
provide sufficient current to flush the unwritten logs into 
the journal and secure the critical mapping structures in 
persistent storage. Periodically the in-memory tables are 
synced into flash and the journal is reinitialized. Dur- 
ing the startup time, the in-flash tables are first loaded 
into memory and the logged updates in the journal are 
applied to reconstruct the mapping tables. 


3.3.3. The metadata pages in flash 


Unlike much prior work, which writes the metadata (e.g. 
LBA and fingerprint) in the spare area of physical flash 
pages, we reserve a dedicated number of flash pages, also 
called metadata pages, to store the metadata, and keep a 
metadata page array for tracking PBAs of the metadata 
pages. The spare area of a physical page is only used for 
storing the Error Correction Code (ECC) checksum. If 
each physical page is associated with 24 bytes of meta- 
data (a 160-bit fingerprint and a 32-bit LBA/VBA), for a 
32GB SSD with 4KB flash pages, we need about 0.6% of 
the flash space for storing metadata and a 192KB meta- 
data page array. In this way, we can detach the data pages 
and the metadata pages, which allows us to manage flex- 
ibly the metadata for physical flash pages. 


3.4 Acceleration Methods 


Fingerprinting is the key bottleneck of the in-line dedu- 
plication in CAFTL, especially when the on-device 
buffer size is limited. Here we present three effective 
techniques to reduce its negative performance impact. 


3.4.1 Sampling for hashing 


In file system workloads, as we discussed previously, du- 
plicate writes are not a ‘common case’ as in backup sys- 
tems. This means that most time we spend on fingerprint- 
ing is not useful at all. Thus, we selectively pick only one 
page as a sample page for fingerprinting, and we use this 
sample fingerprint to query the fingerprint store to see if 
we can find a match there. If this is true, the whole write 
request is very likely to be a duplicate, and we can further 
compute fingerprints for the other pages to confirm that. 


Otherwise, we assume the whole request would not be a 
duplicate and abort fingerprinting at the earliest time. In 
this way, we can significantly reduce the hashing cost. 

The key issue here is which page should be chosen as 
the sample page. It is particularly challenging in CAFTL, 
since CAFTL only sees a sequence of blocks and cannot 
leverage any file-level semantic hints (e.g. [11]). We pro- 
pose to use Content-based Sampling — We select the first 
four bytes, called sample bytes, from each page in a re- 
quest, and we concatenate the four bytes into a 32-bit 
numeric value. We compare these values and the page 
with the /argest value is the sample page. The rationale 
behind this is that if two requests carry similar content, 
the pages with the largest sample bytes in two requests 
would be very likely to be the same, too. We deliberately 
avoid selecting the sample pages based on hash values 
(e.g. [11,30]), because in CAFTL, hashing itself incurs 
high latency. Thus relying on hash values for sampling is 
undesirable, so we directly pick sample pages based on 
their unprocessed content data. We have also examined 
choosing other bytes (Figure 6) as the sample bytes and 
found that using the first four bytes performs constantly 
well across different workloads. 
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Figure 6: An illustration of four choices of sample bytes. 


In our implementation of sampling, we divide the se- 
quence of pages in a write request into several sampling 
units (e.g. 32 pages), and we pick one sample page from 
each unit. We also note that sampling could affect dedu- 
plication — the larger a sampling unit is, the better per- 
formance but the lower deduplication rate would be. We 
will study the effect of unit sizes in Section 4.4.1. 


3.4.2 Light-weight pre-hashing 
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Computing a light-weight hash function often incurs 
lower computational cost. For example, producing a 32- 


bit CRC32 hash value is over 10 times faster than com- 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


83 


84 


puting a 160-bit SHA-1 hash value. More importantly, 
our study shows that reducing the hash strength would 
not incur a significant increase of false positives for a 
typical SSD capacity. We can see in Figure 7 that us- 
ing only 32 bits can achieve nearly the same condense 
rate as using 160 bits. Plus, many SSDs integrate a ded- 
icated ECC engine to compute checksum and detect er- 
rors, which can also be leveraged to speed up hashing. 

We propose a technique, called light-weight pre- 
hashing. We maintain an extra 32-bit CRC32 hash value 
for each fingerprint in the fingerprint store. For a page, 
we first compute a CRC32 hash value and query the fin- 
gerprint store. If a match is found, which means the page 
is very likely to be a duplicate, then we use the SHA- 
1 hash function to generate a fingerprint and confirm it 
in the fingerprint store; otherwise, we abort the high- 
cost SHA-1 fingerprinting immediately and perform the 
write to flash. Although maintaining CRC32 hash val- 
ues demands more fingerprint store space, the significant 
performance benefit well justifies it, as shown in Sec- 
tion 4.4.2. We have also considered using a Bloom fil- 
ter [12] for pre-screening, like in the DataDomain® file 
system [47], but found it inapplicable to CAFTL, be- 
cause it requires multiple hashings and the summary vec- 
tor cannot be updated when a fingerprint is removed. 


3.4.3. Dynamic switches 


In some extreme cases, incoming requests may wait for 
available buffer space to be released by previous re- 
quests. CAFTL provides dynamic switch as the last line 
of defense for performance protection in such cases. 

We set a high watermark and a low watermark to turn 
the in-line deduplication off and on, respectively. If the 
percentage of the occupied cache space hits a high water- 
mark (95%), we disable the in-line deduplication to flush 
writes quickly to flash and release buffer space. Once 
the low watermark (50%) is hit, we re-enable the in-line 
deduplication. Although this remedy solution would re- 
duce the deduplication rate, we still can perform out-of- 
line deduplication at a later time, so it is an acceptable 
tradeoff for retaining high performance. 


3.5  Out-of-line Deduplication 


As mentioned previously, CAFTL does not pursue a per- 
fect in-line deduplication, and an internal routine is pe- 
riodically launched to perform out-of-line fingerprinting 
and out-of-line deduplication during the device idle time. 

Out-of-line fingerprinting is simple. We scan the 
metadata page array (Section 3.3.3) to find physical 
pages not yet fingerprinted. If one such a page is found, 
we read the page out, compute the fingerprint, and up- 
date its metadata. To avoid unnecessarily scanning the 
metadata of pages already fingerprinted, we use one bit 
in an entry of the metadata page array to denote if all of 
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the fingerprints in the corresponding metadata page have 
already been computed, and we skip over such pages. 

Out-of-line deduplication is more complicated due to 
the memory space constraint. We adopt a solution similar 
to the widely used external merge sort [39] in database 
systems. Supposing we have M fingerprints in total and 
the available memory space can accommodate N finger- 
prints, where M > N. We scan the metadata page array 
from the beginning, each time N fingerprints are loaded 
and sorted in memory, and temporarily stored in flash, 
then we load and sort the next N fingerprints, and so on. 
This process is repeated for K times (K = [#}) until all 
the fingerprints are processed. Then we can merge sort 
these K blocks of fingerprints in memory and identify the 
duplicate fingerprints. 

Out-of-line fingerprinting and deduplication can be 
performed together with the GC process or indepen- 
dently. Since there is no harm in leaving duplicate or un- 
fingerprinted pages in flash, these operations can be per- 
formed during idle period and immediately aborted upon 
incoming requests, and the perceivable performance im- 
pact to foreground jobs is minimal. 


4 Performance Evaluation 
4.1 Experimental Systems 


We have implemented and evaluated our design of 
CAFTL based on a comprehensive trace-driven simula- 
tion. In this section we will introduce our simulator, trace 
collection, and system configurations. 


4.1.1 SSD Simulator 


CAFTL is a device-level design running in the SSD con- 
troller. We have implemented it in a sophisticated SSD 
simulator based on the Microsoft® Research SSD exten- 
sion [5] for the DiskSim simulation environment [14]. 
This extension was also used in prior work [6]. 

The Microsoft extension is well modularized and im- 
plements the major components of FTL, such as the indi- 
rect mapping, garbage collection and wear-leveling poli- 
cies, and others. Since the current version lacks an on- 
device buffer, which is becoming a standard component 
in recent generations of SSDs, we augmented the current 
implementation and included a shared buffer for han- 
dling incoming read and write requests. When a write 
request is received at the SSD, it is first buffered in the 
cache, and the SSD immediately reports completion to 
the host. Data processing and flash operations are con- 
ducted asynchronously in the background [16]. A read 
request returns back to the host once the data is loaded 
from flash into the buffer. We should note that this simu- 
lator follows a general FTL design [6], and the actual im- 
plementations of the SSD on the market can have other 
specific features. 
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4.1.2 SSD Configurations 
Description | Configuration 
Flash Page Size | 4KB 
Pages per Block | 64 
Blocks per Plane | 2048 
Planes per Package | 8 
# of Packages | 10 
Mapping policy | Full striping 
Over-provisioning | 15% 
Garbage Collection Threshold | 5% 


Table 1: Configurations of the SSD simulator. 


In our experiments, we use the default configurations 
from the SSD extension, unless denoted otherwise. Table 
1 gives a list of the major config parameters. 


Description Latency 
Flash Read/Write/Erase || 25 s/200us/1.5ms 
SHA-1 hashing (4KB) 47,548 cycles 
CRC32 hashing (4KB) 4,120 cycles 





Table 2: Latencies configured in the SSD simulator. 


Table 2 gives the parameters of latencies used in our 
experiments. For the flash memory, we use the default la- 
tencies in our experiments. For the hashing latencies, we 
first cross compile the hash function code to the ARM® 
platform and run it on the SimpleScalar-ARM simula- 
tor [4] to extract the total number of cycles for executing 
a hash function. We assume a processor similar to ARM® 
Cortex R4 [8] on the device, which is specifically de- 
signed for high-performance embedded devices, includ- 
ing storage. Based on its datasheet, the ARM processor 
has a frequency from 304MHz to 934MHz [8], and we 
can estimate the latency for hashing a 4KB page by divid- 
ing the number of cycles by the processor frequency. It 
is also worth mentioning here that according to our com- 
munications with SSD manufacturer [3], high-frequency 
(600+ MHz) processors, such as the Cortex processor, 
are becoming increasingly normal in high-speed storage 
devices. Leveraging such abundant computing power on 
storage devices can be a research topic for further inves- 
tigation. 


4.1.3. Workloads and trace collection 


We have selected 11 workloads from three representative 
categories and collected their data access traces. 


¢ Desktop (d1,d2) — Typical office workloads, e.g. In- 
ternet surfing, emailing, word editing, etc. The work- 
loads run for 12 and 19 hours, respectively, and feature 
irregular idle intervals and small reads and writes. 


¢ Hadoop (h1-h7)- We execute seven TPC-H data ware- 
house queries (Query 1,6,11,14,15,16,20) with scale 
factor of 1 on a Hadoop distributed system platform 


[2]. These workloads run for 2-40 minutes and gener- 
ate intensive large writes of temp data. 


Transaction (t1,t2) — We execute TPC-C workloads (1- 
3 warehouses, 10 terminals) for transaction processing 
on PostgreSQL 8.4.3 database system. The two work- 
loads run for 30 minutes and 4 hours, respectively, and 
feature intensive write operations. 


The traces are collected on a DELL® Dimension 3100 
workstation with an Intel® Pentium™4 3.0GHz proces- 
sor, a 3GB main memory, and a 160GB 7,200 RPM 
Seagate® hard disk drive. We use Ubuntu 9.10 with the 
Ext3 file system. We modified the Linux kernel 2.6.32 
source code to intercept each I/O request and compute a 
SHA-1 hash value as a fingerprint for each 4KB page of 
the request. These fingerprints, together with other re- 
quest information (e.g. offset, type), are transferred to 
another machine via netconsole [35]. This avoids the 
possible interference caused by tracing. The collected 
trace files are analyzed offline and used to drive the sim- 
ulator for our experimental evaluation. 


4.2 Effectiveness of Deduplication 


CAFTL intends to remove duplicate writes and extend 
flash space. In this section, we perform two sets of ex- 
periments to show the effectiveness of deduplication in 
CAFTL. In both experiments, we use an SSD with a 
934MHz processor and a 16MB buffer. 


4.2.1 Removing duplicate writes 


CAFTL identifies and removes duplicate writes via in- 
line deduplication. Denoting the total number of pages 
requested to be written as n, and the total number of 
pages being actually written into flash medium as m, the 
deduplication rate is defined as =". Figure 8 shows 
the deduplication rate of the 11 workloads running on 
CAFTL. In this figure, offline refers to the optimal case, 
where the traces are examined and deduplicated offline. 
We also show CAFTL without sampling and with a sam- 
pling unit size of 128KB (32 pages), denoted as no- 
sampling and 128KB, respectively. 
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Figure 8: Perc. of removed duplicate writes. 
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As we see in Figure 8, duplication is highly work- 
load dependent. Across the 11 workloads, the rate of 
duplicate writes in the workloads ranges from 5.8% (t1) 
to 28.1% (h6). CAFTL can achieve deduplication rates 
from 4.6% (t1) to 24.2% (h6) with no sampling. Com- 
pared with the optimal case (offline), CAFTL identifies 
up to 86.2% of the duplicate writes in offline. We also can 
see that with a larger sampling unit (128KB), CAFTL 
achieves a lower but reasonable deduplication rate. In 
Section 4.4.1, we will give more detailed analysis on the 
effect of sampling unit sizes. 


4.2.2 Extending flash space 


35 T T 1 
iO! 
30 28KB LI 





























25 | 
20 | | 
15} | 
10 - | 
5 | 
0 


Perc. of Save Flash Space (%) 



































di d2 hi h2 m3 h4 hS he h7 tt t2 
d - desktop; h - hadoop; t - transaction 
Figure 9: Perc. of extended flash space. 

Besides directly removing duplicate writes to the flash 
memory, CAFTL also reduces the amount of occupied 
flash memory space and increases the number of avail- 
able clean erase blocks for garbage collection and wear- 
leveling. Figure 9 shows the percentage of extended flash 
space in units of erase blocks, compared to the baseline 
case (without CAFTL). We show CAFTL without sam- 
pling (no-sampling) and with sampling (/28KB). 

As shown in Figure 9, CAFTL can save up to 31.2% 
(h1) of the occupied flash blocks for the 11 workloads. 
The worst cases are 2 and h5, in which no space saving 
is observed. This is because the two workloads are rela- 
tively smaller, the total number of occupied erase blocks 
is only 176. Although the number of pages being written 
is reduced by 16.6% (h2) and 15% (h5), the saved space 
in units of erase blocks is very small. 


4.3 Performance Impact 


To make CAFTL truly effective in practice, we must 
retain high performance and minimize negative impact. 
Here we study three key factors affecting performance, 
cache size, hashing speed, and fingerprint searching. 
The acceleration methods are not applied in experiments. 


4.3.1 Cache size 


In Figure 10, we show the percentage of the increase 
of average read/write latencies with various cache sizes 
(2MB to 16MB). We compare CAFTL with the baseline 
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case (without CAFTL). In the experiments, we config- 
ure an SSD with a 934MHz processor. We can see that 
with a small cache space (2MB), the read and write la- 
tencies can increase by a factor of up to 34% (t/). With a 
moderate cache size (8MB), the latency increases are re- 
duced to less than 4.5%. With a 16MB cache, a rather 
standard size, the latency increases become negligible 
(ess than 0.5%). For some workloads (d2, h3, h5, h7, 
tl, 12), we can even see a slight performance improve- 
ment (0.2-0.5%), because CAFTL removes unnecessary 
writes, which reduces the probability of being blocked 
by an in-progress flash write operation. In this case we 
see a negative performance impact with a small cache 
space, and we will show how to mitigate such a problem 
through our acceleration methods in Section 4.4. 


4.3.2 Hashing speed 


Computing fingerprints is time consuming and affects ac- 
cess performance. The hashing speed depends on the ca- 
pability of processors. Using a more powerful processor 
can effectively reduce the latency for digesting pages and 
generating fingerprints. To study the performance im- 
pact caused by hashing speed, we vary the processor fre- 
quency from 304MHz to 934MHz, based on the Cortex 
datasheet [8]. We configure an SSD with a 16MB cache 
space and show the increase of read latencies compared 
to the baseline case (without CAFTL) in Figure 11. We 
did not observe an increase of write latencies, since most 
writes are absorbed in the buffer. 
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Figure 11: Perf. impact of hashing Speeds. 


In Figure 11, we can see that most workloads are 
insensitive to hashing speed. With a 304MHz proces- 
sor, the performance overhead is less than 8.5% (#2), 
which has more intensive larger writes. At 934MHz, 
the performance overhead is merely observable (up to 
0.5%). There are two reasons. First, the 16MB on-device 
buffer absorbs most incoming writes and provides a suf- 
ficient space for accommodating incoming reads. Sec- 
ond, the incoming read requests are given a higher pri- 
ority than writes, which reduces noticeable delays in the 
critical path. These optimizations make reads insensitive 
to hashing speed and reduces noticeable latencies. Also 
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Figure 10: Performance impact of cache sizes (2-16MB). 


note that if a dedicated hashing engine is used on the de- 
vice, the hashing latency could be further reduced. 


4.3.3 Fingerprint searching 
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Figure 12: Optimizations on fingerprint searching. 

We have proposed three techniques to accelerate fin- 
gerprint searching. Figure 12 shows the percentage 
of reduced fingerprint comparisons compared with the 
baseline case. We configure the fingerprint store with 
256 segments to hold the fingerprints for each work- 
load. We can see that using Range Check can effec- 
tively reduce the comparisons of fingerprints by up to 
23.7% (12). However, Hotness-based Reorganization can 
provide little further improvement (less than 1%), be- 
cause it essentially accelerates lookups for fingerprints 
that are duplicated, which is relatively an uncommon 
case. As expected, Bucket-level Binary Search can sig- 
nificantly reduce the average number of comparisons 
for each lookup. In d2, for example, Bucket-level Bi- 
nary Search can effectively reduce the average number 
of comparisons by a factor of 85.5%. Thus we would 
suggest applying Bucket-level Binary Search and Range 
Check to speed up fingerprint lookups. 


4.4 Acceleration Methods 


With a small on-device buffer, the high computational 
latency caused by hashing could be significant and per- 
ceived by the users. We have developed three techniques 
to accelerate fingerprinting. In this section, we will show 
the effectiveness of each individual technique and then 


show the effects in aggregate. We configure an SSD with 
a 934MHz processor and a small 2MB buffer. 


4.4.1 Sampling 
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Figure 14: Dedup. with Sampling 

As shown in Figure 13 and Figure 14, sampling can 
significantly improve performance. With the increase 
of sampling unit size, fewer fingerprints need to be cal- 
culated, which translates into a manifold reduction of 
observed read and write latencies. For example, h7 
achieves a speedup by a factor of 94.1 times for reads and 
3.5 times for writes, because of the significantly reduced 
waiting time for the buffer. Meanwhile, the deduplica- 
tion rate is only reduced from 18% to 15.4%. Consider- 
ing such a significant speedup, the minor loss of dedupli- 
cation rate is acceptable. The maximum speedup, 110.6 
times (read), is observed in ¢/, and its deduplication rate 
drops from 4.6% to 1.3%. This is mostly because for 
workloads with low duplication rate, the probability of 
sampling right pages is also relatively low. 


4.4.2 Light-weight pre-hashing 


Light-weight pre-hashing uses a fast CRC32 hash func- 
tion to filter most unlikely-to-be-duplicated pages before 
performing high-cost fingerprinting. Figure 15 shows the 
speedup of reads and writes by using CRC32 for pre- 
hashing, compared with CAFTL without pre-hashing. 
Only pre-hashing is enabled here. We can see that in 
the best case (t/), pre-hashing can reduce the latencies 
by a factor of up to 148.3 times for reads and 3.9 times 
for writes. This is because, as mentioned previously, 
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Figure 13: Performance speedup with Sampling (unit size: 8-128KB). 





CC —— : 
140 | Was 
120 | 
100 - 
80 | 
60 | 
40 | 
20 | 


























Speedup (x) 








Il 


di ‘2 ri h2 h3 ha nS m6 h7 t1 2 
d- desktop; h - hadoop; t - transaction 
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this workload is write intensive and has a long waiting 
queue, which makes the queuing effect particularly sig- 
nificant. Similar to sampling, writes receive relatively 
smaller benefit, because the buffer absorbs the writes 
with low latency and diminishes the effect of speeding 
up writes. Meanwhile, we also found negligible differ- 
ence in deduplication rates, which is consistent with our 
analysis shown in Figure 7. 


4.4.3 Dynamic switch 
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Figure 16: Speedup with dynamic switch. 
CAFTL also provides dynamic switch to dynamically 
turn on/off the in-line deduplication, depending on the 
usage of the on-device buffer. We configure the high 
watermark as 95% (off) and the low watermark as 50% 
(on). Figure 16 shows the speedup of reads and writes 
in the workloads. Again, t/ receives the most significant 
performance speedup by a factor of 200.6 times. Some 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


workloads (h1-h5) receive no benefits, because they are 
less I/O intensive. For the other workloads, we can ob- 
serve a speedup of 2.1 times to 94.6 times. 


4.4.4 Putting it all together 
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Figure 17: Three acceleration tech. combined 

In Figure 17, we enable all the three acceleration tech- 
niques and show the increase of read and write latencies, 
compared with the baseline case (without CAFTL), and 
the corresponding deduplication rate. We can see that by 
combining all the three techniques, we can almost com- 
pletely remove the performance overhead with only a 
2MB on-device buffer. In the meantime, we can achieve 
a deduplication rate of up to 19.9%. 


5 Other Related Work 


Flash memory based SSDs have received a lot of in- 
terest in both academia and industry. There is a large 
body of research work on flash memory and SSDs (e.g. 
[6,9, 13, 15-18, 20,23, 26-29, 31,34,37,38,42,44]). Con- 
cerning lifespan issues, most early work focuses on de- 
signing garbage collection and wear-leveling policies. A 
survey [21] summarizes these techniques. Here we only 
present the papers most related to this work. 

Recently Grupp et al. [22] have presented an empiri- 
cal study on the performance, power, and reliability of 
flash memories. Their results show that flash memories, 
particularly MLC devices, exhibit significant error rates 
after or even before reaching the rated lifetime, which 
makes using high density SSDs in commercial systems a 
difficult choice. Another report [13] has studied the write 
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endurance of USB flash drives with a more optimistic 
conclusion that the endurance of flash memory chips 
is better than expected, but whole-device endurance is 
closely related to the FTL designs. A modeling based 
study on the endurance issues has also been presented 
in [33]. These studies provide much needed information 
about the lifespan of flash memory and small-size flash 
devices. However, so far the endurance of state-of-the- 
art SSDs has not yet been proven in the field [10]. 

Early studies on SSDs mainly focus on performance. 
Some recent studies have begun to look at reliability is- 
sues. Differential RAID [9] tries to improve reliability 
of an SSD-based RAID storage by distributing parity 
unevenly across SSDs to reduce the probability of cor- 
related multi-device failure. Griffin [42] extends SSD 
lifetime by maintaining a log-structured HDD cache and 
migrating cached data periodically. A recent work [36] 
considers write cycles in addition to storage space as a 
constrained resource in depletable storage systems and 
suggests attribute depletion to users in systems like cloud 
computing. ChunkStash [19] uses flash memory to speed 
up index lookups for inline storage deduplication. An- 
other work [43] proposes to integrate phase change mem- 
ory into SSDs to improve the performance, energy con- 
sumption, and also lifetime. Our study has made its 
unique contributions to enhancing the lifespan of SSDs 
by removing duplicate writes and coalescing redundant 
data at the device level, as a more general solution. 


6 Conclusion and Discussions 


Enhancing the SSD lifespan is crucial to a wide deploy- 
ment of SSDs in commercial systems. In this paper, we 
have proposed a solution, called CAFTL, and shown that 
by removing duplicate writes and coalescing redundant 
data, we can effectively enhance the lifespan of SSDs 
while retaining high data access performance. 

A potential concern about CAFTL is the volatility of 
the on-device RAM buffer — the buffered data could be 
lost upon power failure. However, this concern is not 
new to SSDs. A hard disk drive also has an on-device 
buffer, but it provides users an option (e.g. using sdparm 
tool) to flexibly enable/disable the buffer on their needs. 
Similarly, if needed, the users can choose to disable the 
in-line deduplication and the buffer in an SSD, and the 
out-of-line deduplication can still be effective. 

Although we have striven to minimize memory usage, 
CAFTL demands more space for storing fingerprints and 
the secondary mapping table, compared with traditional 
FTLs. According to our communications with SSD man- 
ufacturer [3], memory actually only accounts for a small 
percentage of the total production cost, and the most 
expensive component is flash memory. Thus we con- 
sider this tradeoff is worthwhile to extend available flash 
space, and SSD lifespan. If budget allows, we would 


suggest maintaining the fingerprint store fully in mem- 
ory, which not only improves deduplication rate but also 
simplifies designs. 

Further improvements are also possible. One is to re- 
lax the stringent “one-time programming” requirement. 
According to the specification, each flash page in a clean 
erase block should be programmed (written) only once. 
In practice, flash chips can allow multiple programs to 
a page and the risk of “program disturb” is fairly low 
[7]. We can leverage this feature to simplify many de- 
signs. For example, we can write multiple versions of 
LBA/VBA and fingerprints into the spare area of a phys- 
ical page, which can largely remove the need for meta- 
data pages. Another consideration is to integrate a byte- 
addressable persistent memory (e.g. PCM) into the SSDs 
to maintain the metadata, which can remove much de- 
sign complexity. We are also considering the addition of 
on-line compression into SSDs to better utilize the high- 
speed processor on the device. This can further extend 
available flash space but may require more changes to 
the FTL design, which will be our future work. 

As SSD technology becomes increasingly mature and 
delivers satisfactory performance, we believe, the en- 
durance issue of SSDs, particularly high-density MLC 
SSDs, opens many new research opportunities and 
should receive more attention from researchers. 
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Abstract: NAND flash-based solid-state drives (SSDs) 
are increasingly being deployed in storage systems at dif- 
ferent levels such as buffer-caches and even secondary 
storage. However, the poor reliability and performance 
offered by these SSDs for write-intensive workloads con- 
tinues to be their key shortcoming. Several solutions 
based on traditionally popular notions of temporal and 
spatial locality help reduce write traffic for SSDs. How- 
ever, another form of locality - value locality - has re- 
mained completely unexplored. Value locality implies 
that certain data items (i.e., “values,” not just logical ad- 
dresses) are likely to be accessed preferentially. Given 
evidence for the presence of significant value locality 
in real-world workloads, we design CA-SSD which em- 
ploys content-addressable storage (CAS) to exploit such 
locality. Our CA-SSD design employs enhancements 
primarily in the flash translation layer (FTL) with min- 
imal additional hardware, suggesting its feasibility. Us- 
ing three real-world workloads with content information, 
we devise statistical characterizations of two aspects of 
value locality - value popularity and temporal value lo- 
cality - that form the foundation of CA-SSD. We observe 
that CA-SSD is able to reduce average response times by 
about 59-84% compared to traditional SSDs. Even for 
workloads with little or no value locality, CA-SSD con- 
tinues to offer comparable performance to a traditional 
SSD. Our findings advocate adoption of CAS in SSDs, 
paving the way for a new generation of these devices. 


1 Introduction and Motivation 


NAND flash-based SSDs offer several advantages over 
magnetic hard disks: lower access latencies, lower power 
consumption, lack of noise, and higher robustness to 
vibrations and temperature. Several researchers have 
explored the performance benefits of employing these 
SSDs, either as complete replacements for magnetic 
drives or in supplementary roles (e.g., caches) [23]. 
Whereas a number of other non-volatile memory tech- 


nologies - phase-change, ferroelectric, and magnetic 
RAM - exist at different levels of maturity and offer 
similar benefits, cost/feasibility projections suggest that 
NAND flash (simply flash, henceforth) is likely to be at 
the forefront of these significant changes in storage for 
the next decade [17]. Another trend from EMC sug- 
gests that SSD prices will continue to fall to the extent 
of becoming cheaper than 15K RPM HDDs by 2017 [7]. 
Thus, exploring ways to further improve flash technol- 
ogy and its use in designing better storage systems will 
continue to be worthwhile pursuits in the coming years. 


Flash is a unique memory technology due to the sen- 
sitivity of its reliability and performance to write traf- 
fic. A flash page (the granularity of reads/writes) must 
be erased before it may be written. Erases occur at the 
granularity of blocks which contain multiple pages. Fur- 
thermore, blocks become unreliable after 5K-100K erase 
operations [38, 39, 37]. This erase-before-write property 
of flash necessitates out-of-place updates to prevent the 
relatively high latency of erases from affecting the per- 
formance of writes. These out-of-place updates create 
invalid pages that contain older versions of data requir- 
ing garbage collection. This further exacerbates the re- 
liability/performance concerns by introducing additional 
write operations. Techniques that reduce the number of 
writes to SSDs are, therefore, desirable and have received 
alot of attention. Existing approaches for write reduction 
have relied on exploiting the presence of (i) temporal lo- 
cality (e.g., buffering writes within file system/SSD/other 
media to eliminate duplicate writes to flash [24, 46, 45]), 
and/or (ii) spatial locality (e.g., coalescing multiple sub- 
page writes into fewer page writes [30]) within work- 
loads. However, there is yet another dimension of lo- 
cality - value locality - that has remained unexplored for 
flash SSDs. The presence of value locality in a work- 
load means that it preferentially accesses certain content 
(i.e., values) over others. This property facilitates data 
de-duplication (storing only one copy of each unique 
value), which is especially attractive for SSDs as it nat- 
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urally offers the write reduction that these devices can 
benefit from: a SSD employing such data de-duplication 
need not do an additional write of a value that it has al- 
ready stored. This benefit applies even if the two writes 
belong to entirely different logical addresses and even in 
the absence of any temporal/spatial correlation between 
these two writes. Data de-duplication can also reduce 
read traffic, with additional performance benefits. 

Content addressable storage (CAS) is a popular de- 
duplication technique which operates on data by dividing 
it into non-intersecting chunks, and employing a crypto- 
graphic hash to represent each chunk. By storing only 
unique hashes (and their corresponding data chunks), du- 
plicate chunks in data are removed. Hashing can result 
in collisions where different data blocks can be mapped 
to the same value. However, it has been shown that 
such collisions are practically unlikely, with probabili- 
ties in the range 10~° — 107!” [40, 42] for MDS5 and 
SHA-1. Additionally, techniques to further reduce this 
probability to as low as 10~4° have been shown to be 
feasible [40, 43]. Thus, consistent with most CAS re- 
search [12, 42, 35], we also assume hash functions to 
be collision-resistant. CAS has been extensively used in 
archival and backup systems [42, 43, 14], but its bene- 
fits specific to SSDs have not been explored. Whereas 
SSDs could benefit from existing host-level (e.g., file 
system [47]) implementations of CAS, thereby reducing 
1/O traffic, there is significant motivation to realize this 
functionality within the device itself. It allows incorpora- 
tion of value locality without requiring any modifications 
to the upper layers (filesystem, block layer etc.), thus al- 
lowing quick adoption in existing systems. Several SSD 
optimizations that rely upon information about flash data 
layout are better implemented within the SSD. For ex- 
ample, garbage collection efficiency can be improved by 
using data placement policies which reduce overheads of 
copying valid pages. Also, scalability of a CAS-based 
scheme crucially depends on its ability to carry out fast 
calculations/look-ups of hashes. This can be achieved by 
using dedicated hardware such as that increasingly avail- 
able in SSDs (e.g., those with Full Disk Encryption ca- 
pabilities [5, 44, 41]), relieving the host of these compu- 
tational overheads. 


Key Choices and Challenges: A number of interest- 
ing design choices and challenges arise when designing 
a SSD that employs CAS for its internal data manage- 
ment. First, in order to maintain compatibility with ex- 
isting storage software, we choose that our SSD continue 
to expose its existing block interface. Modifications to 
the SSD interface such as nameless writes [11] can po- 
tentially benefit CA-SSD but require changes to the up- 
per layers. Second, employing CAS necessitates sev- 
eral enhancements to the data structures maintained by 
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our SSD’s flash translation layer (FTL). This increased 
“meta-data” puts additional pressure on the scarce on- 
SSD RAM and must be managed carefully. Third, data 
de-duplication renders ineffective existing mechanisms 
employed by the FTL to recover its meta-data after power 
failures. Existing FTLs store information about the logi- 
cal address (LPN) stored on a flash page in a special re- 
gion called the out-of-band area (OOB) within the page 
itself. Due to de-duplication with CAS, a given page 
may correspond to multiple LPNs (different LPNs may 
contain the same content) , and thus, its OOB area can- 
not be used as before. Fourth, with CAS the notion of 
when a page becomes invalid changes - a page should 
now be invalidated only when all the LPNs having that 
content have written a “different content” - implying a 
re-consideration of the design of the garbage collector. 
Finally, whereas we design our SSD to exploit value lo- 
cality whenever present, we would like it not to exhibit 
degraded performance or reliability than a state-of-the- 
art SSD in the absence of such locality. 


Research Contributions: 
tributions in this paper. 


We make the following con- 


e We propose CA-SSD, a flash solid-state drive that 
employs CAS for internal data management and ad- 
dresses all the concerns outlined above. We demon- 
strate how CA-SSD functionality can be achieved 
mostly by modifying the FTL and with minimal 
support in the form of additional hardware com- 
pared to traditional SSDs. This additional hardware 
is similar to that already present in many state-of- 
the-art SSDs. 


We identify and characterize salient aspects of value 
locality- value popularity and temporal value local- 
ity and design CA-SSD algorithms to exploit them. 


e Using three real-world workloads with content in- 
formation, we evaluate the efficacy of CA-SSD by 
simulations. We observe that CA-SSD is able to re- 
duce the average response times by about 59-84% 
for these workloads. Additionally, from 10 real- 
world traces, we synthesize workloads with differ- 
ent degrees of value locality. We find that CA-SSD 
consistently outperforms traditional SSD with even 
small degrees of locality and offers comparable per- 
formance when there is little or no value locality. 


The rest of this paper is organized as follows. In Sec- 
tion 2 we provide an overview of the design of our CA- 
SSD comparing it to traditional SSDs. We discuss key 
aspects of value locality that affect CA-SSD design in 
Section 3. We design CA-SSD using insights gained in 
Section 4 and evaluate it in Section 5. Finally, we present 
related work in Section 6 and conclude in Section 7. 
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2 Overview of Our CA-SSD 
























































Data Unit Access Time Lifetime | 

Type Page (Bytes)| Block  ||Read|Write|Erase|/Write/Erase 
Data} OOB | (Bytes) || (us) | (us) |(ms) |} (cycles) 

SLC1)|2048} 64 | 128K+4K |) 25 | 200 | 1.5 100K | 

SLC2/|4096} 128 | 256K+8K |} 25 | 500 } 1.5 100K | 

MLC]/4096| 224 |512K+28K/| 60 | 800 | 2.5 5K | 





Table 1: SLC & MLC NAND Flash characteristics [38, 39, 
37]. SLC1/SLC2 represent SLC SSDs with different page 
sizes. Read/write latencies are at the granularity of pages while 
erase latencies are for blocks. 


In this section, we describe how a flash-based SSD 
works and provide an overview of the changes to imple- 
ment our CA-SSD. 


2.1 Flash Solid-State Drives: A Primer 


Figure 1(a) presents the key components of a traditional 
NAND flash-based SSD. In addition to the read and write 
operations which are performed at the granularity of a 
page, flash also provides an erase operation which is 
performed at the granularity of a block (composed of 
64-128 pages). The coarser spatial granularity of erases 
makes them significantly slower than reads/writes. Fur- 
thermore, there is an asymmetry in read and write la- 
tencies, with writes being slower than reads. Blocks are 
further arranged in planes which can allow simultane- 
ous operations through multi-plane commands thus im- 
proving performance [10]. In this paper, we only con- 
sider a single plane and our ideas and results apply read- 
ily to multiple planes. A page must first be erased be- 
fore it can be written. The erase-before-write property of 
flash memory necessitates out-of-place updates to pre- 
vent the relatively high latency of erases from affecting 
the performance of updates. These out-of-place updates 
result in invalidation of older versions of pages requir- 
ing Garbage Collection (GC) to reclaim certain invalid 
pages in order to create room for newer writes. At a high 
level, GC operates by erasing certain blocks after relo- 
cating any valid pages within them to new pages. A fi- 
nal characteristic concerns the lifetime of flash memory, 
which is limited by the number of erase operations on its 
cells. Each block typically has a lifetime of 5K(MLC) or 
100K(SLC) erase operations. Wear leveling (WL) tech- 
niques [20, 22, 32] are employed by the FTL to maintain 
similar lifetime for all the blocks. Table | presents repre- 
sentative values for the operational latencies, page/block 
sizes, and lifetime for two main flash technologies (SLC 
and MLC) [38, 39, 37]. We consider SLC-based flash in 
this work, although our ideas apply equally to MLC. 
The Flash Translation Layer (FTL) is a software layer 
that helps in emulating an SSD as a block device by hid- 


ing the erase-before-write characteristics of flash mem- 
ory. The FTL consists of three main logical compo- 
nents: (i) a Mapping Unit that performs data placement 
and translation of logical-physical addresses, (ii) the GC, 
and (iii) the WL. A key data structure maintained by the 
FTL is a Mapping Table which stores address transla- 
tions. Upon receiving a write/update request for a logical 
page the FTL: (i) chooses an erased physical page where 
it writes this data, (ii) invalidates the previous version (if 
any) of the page in question, and (iii) updates its map- 
ping table to reflect this change. The Mapping Table is 
typically stored on SSD’s RAM to allow fast translation!. 


2.2. SSD Enhancements for CAS 


In Figure 1(b), we present the additional compo- 
nents/functionality (compared to a traditional drive) re- 
quired by CA-SSD. For both devices, we also show the 
steps involved in processing requests coming from the 
block device driver to help understand the difference in 
their operation. We refer to the FTL in CA-SSD as CA- 
FTL. Read requests are handled identically in both the 
SSDs and so we only focus on write requests. Whereas a 
traditional SSD requires all writes to be sent to physical 
pages, CA-SSD returns a write request without requiring 
flash page writes if hashes, representing their content, are 
found in RAM. We require four key enhancements to a 
traditional SSD to achieve this functionality. 

(1) Hashing Unit: CA-FTL requires the ability to com- 
pute/compare content hashes such that these operations 
only degrade the CA-SSD performance to a negligible 
(or tolerable) extent. To ensure this, we propose to em- 
ploy a dedicated co-processor to implement our hash- 
ing unit. Recently, manufacturers like OCZ [5], Sam- 
sung [44] and pureSilicon [41] have developed high per- 
formance SSDs with on-board cryptographic processors, 
suggesting that the desired fast hashing is feasible. 

(ii) Additional Meta-data: Mapping Unit must main- 
tain additional data structures for CAS that puts addi- 
tional pressure on the on-SSD RAM. These structures 
represent CA-FTL’s meta-data (to be distinguished from 
the meta-data for software such as the file system) and 
the portion of on-SSD RAM used for storing it is re- 
ferred as the meta-data cache. We describe these data 
structures and space-efficient ways of managing them in 
Section 4.1. 

(iii) Persistent Meta-data Store: Our CA-SSD de- 
sign necessitates a re-consideration of the mechanism 
for recovering the contents of the meta-data cache after 
a power failure. When writing a physical page (PPN), 
a traditional FTL also stores the logical page number 
(LPN) in a special-purpose part of the PPN called the 


‘An SSD typically has a small SRAM and a larger DRAM cache 
whose size is in the range 64-512 MB for an SSD with capacity 256- 
1024 GB [1, 6]. We ignore this distinction in our discussion. 
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Figure 1: Components of a CA-SSD compared to traditional SSD. CA-SSD has two new hardware elements: (i) a 
hashing co-processor and (1i) a battery-backed RAM (BB-RAM). Furthermore, CA-SSD stores hashes instead of LPN 
in the page OOB area. Also shown is a comparison of how writes are handled in the two devices. (a) Traditional SSD: 
(1-2) On receiving a write request from device driver, SSD controller issues a flash page write. (3-4) On completion, 
the Mapping Table in the volatile RAM is updated and driver is notified of request completion. (b) CA-SSD: (1-2) On 
receiving a write request, the SSD controller sends the content to the hash co-processor for hash computation. (3-4) 
The returned hash is then looked up in the Mapping Table in the BB-RAM. (5-6(a)) On a hit, the mapping structures 
are updated and the request completes. (5-9(b)) On a miss, a flash page write is performed, mapping structures are 


updated and the request is completed. 


out-of-band (OOB) area, which is typically 64-224 B in 
size. After a power failure, these entries in the OOB 
are used to reconstruct the LPN-to-PPN mappings. In 
CA-FTL, multiple LPNs may contain the same value and 
hence correspond to the same PPN. The OOB area may 
not have enough room for all these LPNs. Furthermore, a 
value can be associated with a changing set of LPNs over 
its lifetime, requiring multiple writes to the same OOB 
area, with corresponding erase/copying operations. We 
address this difficulty by requiring that CA-FTL’s Map- 
ping Table be kept in a fast persistent storage in the first 
place, without any need to store a copy on flash. Storing 
a copy on flash would result in large number of meta- 
data writes on flash increasing the number of flash page 
writes. An alternative approach could be to perform peri- 
odic check-pointing of Mapping Table instead of imme- 
diate writes on flash to reduce the number of meta-data 
writes, thereby providing weaker guarantees on meta- 
data consistency. In order to provide consistency guaran- 
tees similar to existing SSDs without impacting the over- 
all performance, we employ persistent battery-backed 
RAM. We indicate this as BB-RAM in Figure 1(b). 
Write caches based on such battery-backed DRAM are 
commonly used in RAID controllers [3]. Even SSD man- 
ufacturers have started providing battery-backed DRAM 
as a standard feature to deal with power failures [5, 4]. 
Such SSDs with both battery-backed DRAM as well as 
on-board cryptographic processors have similar perfor- 
mance and costs as compared to traditional SSDs [5] 
mitigating performance and cost concerns for CA-SSD. 
Recent work has considered employing other persistent 
media (e.g., PCM [45] and even hard disk [46]) for SSD 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


write optimizations, and exploring such alternatives for 
CA-SSD meta-data cache is part of future work. 

(iv) Re-design of GC: CAS results in a change to GC. 
In conventional FTLs, each update results in the invalida- 
tion of a page requiring an eventual erase operation. But 
CA-FTL only needs to invalidate a page when no LPN 
points to the value in that page. This redefines the way 
garbage is created and distributed in blocks impacting the 
efficiency of GC. We study these issues in Section 4.2. 
We do not modify WL policy in this work and assume 
CA-SSD continues to employ the default WL. 


3 Value Locality Characterization 


We describe two aspects of value locality (VL) that have 
performance/lifetime implications for a CA-SSD. We 
propose ways to express these aspects statistically and 
discuss their implications for possible improvements in 
CA-SSD. Throughout our discussion, we employ three 
workload traces [26] described in Table 2 to present ex- 
amples of our VL characterization. homes represents a 
file server of the home directories of a research group 
in FIU’s CIS department. A major source of content 
similarity in this workload can be attributed to work 
done by different members of the group on copies of 
same software codes, technical documents etc. present 
in their directories. mail has been collected from the 
e-mail server of the same department containing simi- 
lar mailing-list emails and circulated attachments result- 
ing in content similarity across user INBOXes. Finally, 
web is their Web server workload consisting of virtual 
machines hosting an online course management system 
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Figure 2: Value popularity in real-world workloads (1 day traces). The x-axis consists of unique values sorted accord- 
ing to their read or write popularity. That is, a given point on the x-axis might correspond to different values for reads 
and writes. We also show the number of unique values that correspond to 50% of all write requests. 

















Size} % | Req. |Unique Request (%)|Seq. 
Workload|(GB)|Writes|(mill.)/Write Read % 
web |1.95| 77.01] 3.8 [42.35 32.05 83.8 
mail |4.22|77.32| 3.6 | 7.83 80.85 94.7 
homes |3.02|96.76| 4.4 |66.37 80.75 70.8 





























Table 2: Workload statistics. Workload duration varies 
from 1 day (mail) to 7 days (web,homes). Size repre- 
sents the total number of unique LPNs accessed in the 
trace over the mentioned duration and hence represents a 
compacted trace without any intermediate non-accessed 
LPNs (The SSD size chosen for evaluation is 4GB for 
homes & web and 6GB for mail). The logical address 
space exposed to the file-system is much larger [26]. 
Unique Request denotes the fraction of write(read) re- 
quests which write(read) unique 4KB chunks. Requests 
are deemed sequential(seq.) if they access consecutive 
LPNs. 


and email access portal. These workloads are primar- 
ily write-dominant, especially homes, which has about 
97% write requests. Individual requests in these work- 
loads are of size 4KB, along with a 16B hash(MDS5) of 
the contents. 


Value Popularity (VP): The most straightforward 
characterization of VL represents the popularity (num- 
ber of occurrences) of each unique value, for both reads 
and writes separately. The VL for writes and reads have 
different implications for CA-SSD: whereas the former 
captures reduction in write traffic offered by caching 
the corresponding (value, LPN, physical page) informa- 
tion in the meta-data cache, the latter captures reduc- 
tion in reads due to caching the corresponding content 
in the content cache. Table 2 shows the high VP ex- 
hibited by real-world workloads. For instance, mail has 
only 7.83% unique write requests, representing a huge 


potential for de-duplicating the remaining 2.63 million 
writes. Similarly, web and homes can provide 57.65% 
and 33.63% write reductions respectively, improving the 
performance and lifetime of SSDs substantially. Further- 
more, only a small fraction of writes in these workloads 
are due to same values being written at the same lo- 
cations. For example, about 8% overall writes in mail 
and homes are due to same LPN writing the same con- 
tent successively. A majority of duplicate writes are at- 
tributed to same content being written to different lo- 
cations requiring sophisticated CAS-based scheme for 
de-duplication. In Figure 2, we present VP (as CDFs) 
for reads and writes for the three workloads. A given 
point on the x-axis can correspond to different values for 
reads/writes. 

The following insights and observations emerge from 
our definition and these statistics. First, all these work- 
loads exhibit significant skewness in VP, i.e, a small frac- 
tion of total values account for large number of accesses. 
For example, the fraction of total unique values that ac- 
count for 50% of the overall writes are 14.44%, 8.84%, 
and 29.99% for homes, mail and web respectively(shown 
by dotted lines). Therefore, pinning these values in 
the meta-data cache can offer write traffic reduction of 
35.56%, 41.16%, and 20.01%, respectively. Similar 
benefits apply for reads upon caching the most popu- 
lar (value, content) pairs in the content cache. Second, 
we find that these workloads exhibit different degrees 
of value popularity (e.g., homes has higher VP for reads 
than mail, while mail has higher VP for writes) implying 
different degrees of potential benefits for reads/writes. 


Temporal Value Locality (TVL): The presence of 
TVL in a workload implies that if a certain value (as 
opposed to LPN) is accessed now, it is likely to be ac- 
cessed again in the near future, not necessarily by the 
same LPN. We distinguish TVL for writes and reads to 
be able to differentiate benefits that could be obtained 
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Figure 3: Temporal value locality and temporal locality(labeled LPN) for writes in real-world workloads (1 day traces). 
We show the meta-data cache size that can contribute to 90% of the total writes. 


from the use of meta-data vs. content caches. We mod- 
ify a standard way of characterizing LPN-based temporal 
locality for representing TVL [21]. For each workload, 
assuming the meta-data cache to be managed as a queue 
with a least-recently-used (LRU) eviction policy for val- 
ues, we present CDFs of number of writes of the value 
at the (i + 1)*’(i > 0) location within the LRU queue in 
Figure 3. 
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Figure 4: Cache miss rate for popular values. The num- 
ber in brackets represent the length of the trace in number 
of days. Note that popular values denotes the minimum 
number of values which account for 50% of accesses in 
the workload. The cache size on the X-axis (logscale) is 
in terms of 1K hashes. 


Implications for writes: The presence of TVL for writes 
implies that even a small meta-data cache could achieve 
high hit rates to provide write reduction. For example, 
the maximum size of the meta-data cache required for 
storing all the values in the 1 day trace of homes is around 
7.5MB (each entry in this cache requires 28B for stor- 
ing the hashing structures as we explain in Section 4.1). 
However, 90% of writes for homes are satisfied within 
11046 positions in the LRU queue requiring only about 
600KB in the meta-data cache, thus reducing the space 
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requirements by about 96%. Even mail which shows 
lesser TVL provides savings of approximately 65% for 
achieving 90% hit rate. 


Clearly the size of meta-data cache affects these gains. 
Figure 4 shows the miss rate for popular value lookups 
done for writes as a function of different sizes of this 
cache. Additionally, we use portions of the workloads 
over 1-day and 2-day periods and find that TVL sustains 
over this duration. We find that for our workloads, a 
LRU cache based on TVL is able to hold popular val- 
ues, thus offering an easy way to implement a technique 
that can recognize VP. Whereas in our workloads, TVL 
and skewness in VP occur together, generally speaking, 
these could be mutually exclusive. For example, it may 
be the case that for a workload with high TVL, all val- 
ues are equally popular, i.e., have comparable number 
of write accesses, thus displaying low skewness in VP. 
Alternately, a workload with high skewness(in VP) can 
exhibit low TVL if the popular values have a long time 
gap between successive accesses. We design CA-SSD 
so that it can exploit these properties whenever present, 
but not experience degraded performance (compared to a 
regular SSD) when these are absent. 


Implications for reads: We observe higher TVL than 
temporal locality even for reads suggesting that, for these 
workloads, a value-based content cache is likely to out- 
perform one using LPNs and offer reduction in read traf- 
fic to the SSD. Similar observation was made in [26] for 
developing a content based cache for improving I/O per- 
formance in the context of HDD-based storage. 


Finally, one could also consider a notion of spatial 
value locality (SVL). The principal of spatial locality, as 
used conventionally, can be stated as follows: if (con- 
tent corresponding to) a logical address X is accessed 
now, addresses in the neighborhood of X are likely to 
be accessed in the near future. SVL emerges from a 
generalized take on what the neighborhood or proxim- 
ity of a data item means. It posits that a given value, 
even when part of different logical data items, is likely 


USENIX Association 


USENIX Association 


to see similarities among the values in its neighborhood. 
Stated another way, spatial value locality hypothesizes 
that there might exist positive correlations among cer- 
tain values in terms of their closeness with respect to 
their addresses within (possibly multiple) logical data 
objects. SVL has been used for handling disk bottleneck 
for meta-data management in CAS systems for backup 
applications by prefetching key-value pairs which are ac- 
cessed together [48]. For SSDs, it can provide additional 
benefits for reads when sub-page level chunks are used. 
We do not explore SVL or other optimizations for reads 
in this work and leave it as part of our future work. 


4 Design of CA-FTL 


We develop the CA-FTL mapping unit and GC based on 
the issues discussed in Section 2. We assume a CAS 
chunk unit to be equal to the flash page size. 


4.1. The CA-FTL Mapping Unit 


Address Translation and Meta-data Management: 
As discussed in Section 2, CA-FTL requires additional 
data structures for maintaining information about hashes 
and their relationship with LPNs. Figure 5(b) shows the 
data structures we employ to realize CA-FTL’s Mapping 
Unit. We assume address translations to be kept at the 
granularity of a page. Such page-level mappings have 
been shown to be desirable and scalable in recent re- 
search [16, 25]. First, similar to existing FTLs, we have a 
table called LPT which stores translations between LPNs 
to PPNs. Each entry requires 4B for storing the LPN and 
another 4B for PPN. Thus, the maximum space needed 
for LPT in a 4GB SSD is 8MB (for 100% flash utiliza- 
tion). Second, an inverted LPT (iLPT) stores the list 
of LPNs that correspond to the same value and thus the 
same PPN. The iLPT is used to keep track of valid val- 
ues. If the LPN list for a PPN is empty, it signifies that 
no LPN stores the value present in that PPN and the page 
should be invalidated. The iLPT is queried during GC 
and WL for updating the LPT whenever the PPN stor- 
ing a value changes. Third, we use a hash-to-PPN table 
(HPT) to store hash to PPN mappings that is looked up 
on a write request to decide whether the write is for an 
existing value (no flash write needed) or a new value (re- 
quires a flash write). Entries are inserted or updated in 
the HPT upon (i) a write request with a new value or (ii) 
a page write due to GC/WL, respectively. Page invalida- 
tions result in removal of entries. Each hash is 16-20B 
long depending on the hashing algorithm used (16B for 
MDS and 20B for SHA1) whereas a PPN is 4B long. For 
a 4GB SSD, the maximum space needed for the HPT 
is 20-24MB since the maximum number of PPNs it can 
store is 1M. All further discussion is in context with MD5 


hashes present in the available real-world workloads but 
our ideas apply readily for SHA1 hashes also. Fourth, 
we employ an inverted HPT (i{HPT) which maps PPNs 
to hashes by storing the addresses of the corresponding 
HPT entries. It stores the same number of valid entries as 
HPT. When a flash page is invalidated, iHPT provides the 
address of the corresponding HPT entry to be removed 
without incurring an OOB read. 


Let us now understand how to deal with space over- 
heads of these data structures. (i) Gupta et al. [16] 
have proposed page based FTL which exploits tempo- 
ral locality in workloads to reduce the LPT space re- 
quirements. As shown in Figure 3, real-world work- 
loads also demonstrate significant temporal locality apart 
from TVL. Thus, we can utilize variants of page-based 
FTLs to reduce the space requirements by only storing 
a subset of the LPTALPT in our BB-RAM. (ii) Since 
the HPT/iHPT’s space needs can be prohibitively large 
(recall that on a 4GB flash, they require up to 23MB of 
RAM), we are forced to only store a subset. Given our 
findings about the presence of TVL in workloads, we im- 
plement the HPT as a cache of hash-to-PPN mappings 
employing a LRU eviction policy for writes of values. 
The size of this cache could be chosen by CA-FTL based 
on how much RAM it can afford to use for meta-data 
storage. When all of this cache is occupied, to insert a 
new entry we discard the least-recently-used entry from 
the HPT and the iHPT. A salient aspect of our strategy 
is that, unlike a traditional LRU-based queue, we do not 
maintain the remainder of the HPT/iHPT (which does not 
fit in RAM) on another storage medium (e.g., the flash 
medium itself). On an entry’s eviction from the meta- 
data cache, we simply discard it. This saves us potential 
flash page writes (write-back of evicted dirty entries) and 
reads (mapping entry lookup on a HPT/iHPT miss). This 
scheme trades off reduction in RAM occupied for meta- 
data for a reduction in the degree of data de-duplication 
achieved, since some values may be re-written upon HPT 
misses. Our findings on TVL in real workloads in Sec- 
tion 3 suggest that such misses are likely to be rare even 
for nominal cache sizes. This is shown in Figure 4 where 
a cache size of 1.75MB (for storing 64K hashes) yields 
miss rates less than 7% for mail. For web and homes, 
these miss rates are even smaller, being 0.4% and 4%, 
respectively. Thus, most of the discarded entries corre- 
spond to less popular values which have low write fre- 
quency and less impact on de-duplication efficiency. In 
Section 5, we evaluate the performance of CA-SSD with 
different meta-data cache sizes. Data/meta-data con- 
sistency is not impacted due to this scheme since the 
LPT which stores the LPN-to-PPN mappings required 
for managing consistency (explained earlier in Section 
2.2) is managed independent of this strategy. Further- 
more, BB-RAM is only needed for persistent storage of 
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Figure 5: (a) Flowchart depicting how writes are handled by CA-FTL. val represents the content to be written. (b) 
Example of write requests: (1) Write request (L3,V3) to a new LPN L3 with a new value V3 results in a flash page 
write (P3). Entries are added to all four data structures. (2) Update (L2,V1) results in a HPT hit for H1, the entry 
is moved to the head of LRU queue(based on TVL) in HPT. L2 is then added to the the LPN list for P1 in the iLPT 
and removed from P2’s list. Since P2’s list (in the iLPT) is now empty, the flash page (P2) is invalidated and the 
corresponding entries in HPT and iHPT are removed. (Note that iHPT only stores the address of the corresponding 


HPT entry and not the complete hash.) 


LPT whereas other mapping structures can be stored on 
volatile RAM without impacting consistency. 


Handling Read/Write Requests: Read requests in 
CA-FTL are handled similar to traditional FTLs. LPT 
is looked up to locate the PPN storing the value and its 
contents are returned to the upper layers. The flowchart 
in Figure 5(a) describes the handling the write/update re- 
quests in CA-FTL. On receiving a write request, the hash 
of the value for each LPN comprising the request is cal- 
culated and the HPT is then looked up with this hash. A 
miss is deemed to indicate request for a new value and 
a flash page write is issued. If the HPT is fully occu- 
pied, the least recently used entry is discarded and the 
new (hash, PPN) entry is inserted at the head of the LRU 
queue based on TVL. Corresponding updates are made 
in the iHPT also. Finally, the LPT and the iLPT are up- 
dated. On a hit in the HPT, the entry is moved to the head 
of the LRU queue and LPT/iLPT are updated. Further- 
more, update requests may result in LPN storing a differ- 
ent value, requiring modifications to the mapping entries 
for the LPN’s earlier value. If the LPN list in iLPT for the 
PPN corresponding to the LPN’s earlier value is empty, 
the entry and the physical page on flash are invalidated. 
Finally, the HPT/iHPT entries for this PPN are also re- 
moved (Note that the eviction strategy may have already 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


discarded the HPT/AHPT entries, hence not requiring an 
explicit removal). Figure 5(b) gives examples describ- 
ing the handling of writes in CA-FTL including relevant 
meta-data cache management. 


4.2 Garbage Collection in CA-FTL 


Unlike conventional SSDs where all writes are prop- 
agated to flash, CA-SSD only requires one write per 
unique value (Wynique) except in the case of meta-data 
cache misses (due to limited cache size) where some 
duplicate values (Waup) may also be written. Writes 
may also be needed for values which have been invali- 
dated/erased (when no LPN points to them) and are re- 
born (W,-eborn) due to subsequent write requests. Similar 
to conventional SSDs, the final component is GC writes 
(Wc) which depends on the number of GC invocations 
as well as the number of valid pages copied upon each 
such invocation. Therefore, the total writes for a CA- 
SSD can be expressed as a sum of these components: 
Wrotal = Wunique + Waup + Wreborn Woe: 

In traditional SSDs, every LPN update results in inval- 
idation of the PPN containing the previous LPN version. 
CA-SSD only invalidates pages when the value in them 
becomes dead in the sense of no LPN being associated 
with it any longer. Thus, garbage is likely to be gener- 
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Figure 6: Cumulative distribution of valid pages in 
blocks erased during GC in web workload. 


ated at a slower rate in CA-SSD. This coupled with the 
reduction in write traffic to flash due to de-duplication 
decreases the number of GC invocations for the same 
GC policy as in a traditional SSD. The other aspect is the 
number of pages copied during GC. As shown in Figure 6 
for web, the valid content in the victim blocks is much 
lower in CA-SSD as compared to that in traditional SSD. 
The average number of pages copied per block decreases 
from 33.20 to 8.21, a reduction of about 75.27% with 
CA-SSD. This is primarily due to data de-duplication 
which reduces the amount of total valid content stored 
on flash, in turn increasing the fraction of invalid pages 
in victims. These observations lead us to conclude that 
existing GC mechanisms should work well even in a CA- 
SSD. We evaluate the impact of our choice in Section 5. 


5 Experimental Evaluation 


5.1 Experimental Setup 


We simulate both traditional and CA-SSDs using SSD 
simulator [10] which has been integrated into Disksim- 
4.0 [19]. The SSD simulator is capable of simulating 
both SLC and MLC SSDs with multiple planes and dies. 
As described in Section 2, we use SLC SSDs with extra 
large pages(SLC2) and single plane in this study (refer 
to Table 1 for SSD properties). We have modified the 
Disksim interface to use block-based traces with content 
hashes. We have implemented the FTL for our CA-SSD 
(CA-FTL) with the meta-data cache manintained using 
LRU eviction based on TVL. We simulate the hashing 
unit in CA-SSD by modeling the overheads (32j1s [18]) 
of performing hash calculation along with their impact 
on the queueing delays at the SSD controller. Note that 
this is a conservative estimate and the hash calculation 
overheads are likely to be much lower in CA-SSD (As 
discussed in Section 2.2, SSDs with crypto-units have re- 


ported similar performance to traditional SSDs [5]). As 
explained earlier, we do not simulate read caching in ei- 
ther traditional or CA-SSD. 


5.2 Real-world Traces 


We first focus on the three real workload traces that were 
found to exhibit high VL in Section 3. Figure 7(a) shows 
the mean response time comparing the standard SSD 
with two CA-SSD configurations: (i) sufficient capac- 
ity in its RAM to store HPT/iHPT and (ii) capacity to 
store only a fixed number of hashes in RAM. For exam- 
ple, storing 128K hashes in HPT/iHPT requires 3.5MB. 
We also present mean response times for other meta-data 
cache configurations. We note the tremendous perfor- 
mance benefits obtained with our CA-SSD compared to 
the traditional SSD and the benefits directly correlate to 
the value locality/popularity in writes. For instance, the 
mail workload, which in Figure 2(b) demonstrates the 
highest VP of the three for writes, shows a 84% reduc- 
tion in response time with CA-SSD compared to the tra- 
ditional SSD. The reductions are substantial for homes 
and web as well, which show 59% and 65% improve- 
ments in response times. 

In order to understand these benefits further, we break 
down the write traffic into those that are (a) directly im- 
posed by the workload and (b) additional writes imposed 
due to GC when valid pages need to be copied across 
blocks. The number of writes in each category is shown 
in Figure 7(b) for the traditional SSD and our CA-SSD. 
Overall, the reductions in write traffic for CA-SSD are 
771%, 93% and 70% for web,mail and homes, respec- 
tively over a traditional SSD. We see significant reduc- 
tions in writes of both categories. The drop in category 
(a) is intuititve to follow given the value popularity in 
the workloads. Additionally, there is significant reduc- 
tion in category (b) writes as well - 94%, 100% and 87% 
for web, mail , and homes, respectively. In fact, in per- 
centage terms, these GC write reductions overshadow the 
category (a) reductions. Note that GC overhead is a func- 
tion of the amount of garbage in the flash, and the distri- 
bution of this garbage across the blocks. Since a page 
in CA-SSD is treated as invalid only when all the LPNs 
having that content have written a “different value,” it 
is less likely to be marked as garbage compared to a 
traditional SSD where “any” (including the prior identi- 
cal) LPN write necessitates a page invalidation. Further- 
more, the decrease in the amount of valid content on the 
SSD due to de-duplication directly reduces pages copied 
during GC. All these reasons contribute to the substan- 
tial benefits that CA-SSD experiences in lower induced 
writes/copies compared to a traditional SSD. In fact, for 
the mail workload we observe no GC writes since the to- 
tal number of unique values seen for this workload fits 
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Figure 7: 


Performance of CA-SSD vs traditional SSDs. (b)The reborn writes fraction is extremely low and hence 


not shown. The bars for each workload should be read in the following order: NonCAS, CAS(infinite), CAS(16K), 
CAS(128K), CAS(256K). Note that CAS(x) represents the meta-data cache size in terms of number of hashes(x) it 
can store. For response times, we also present the standard deviation, and observe that CA-SSD offers reduction in the 


variance in addition to the average. 


within the chosen SSD size without triggering GC. 


Another important characteristic is the lifetime of 
SSD which depends on the write-erase cycles of blocks. 
Higher incoming write traffic results in higher block 
erases, reducing the useful lifetime of SSD. Write reduc- 
tion benefits from CA-SSD on both workload and GC 
writes directly translate into reduced block erases. As 
shown in Figure 7(c), the number of block erases in mail 
reduces from 47819 to 2876, more than 15-fold decrease. 
Similarly, homes and web experience 70% and 77% re- 
ductions in block erases, respectively. 


In Figure 7, we showed results for CA-SSD with both 
unlimited RAM capacity to store the HPT/HPT, as well 
as finite capacities of 16K, 128K, and 256K entries that 
require about 450KB, 3.5MB and 7MB of space respec- 
tively. Even for meta-data cache capacities less than 
1MB, CA-SSD shows significant improvements over tra- 
ditional SSD. For example, the mean response time for 
homes decreases by about 7ms (for 16K hashes) in CA- 
SSD as compared to traditional SSD whereas the block 
erases reduce by 65%. As we had seen in Section 3, mail 
shows lower TVL for writes and hence requires a larger 
meta-data cache to exploit CA-SSD benefits. However, 
we note that beyond 128K entries, we observe close to 
the infinite CA-SSD behavior for all workloads, reiter- 
ating the observations made in Section 3 regarding the 
ability to hold a substantial portion of the working set 
of the meta-data in these workloads within a relatively 
small space because of presence of TVL. 3.5MB of RAM 
is a relatively small amount of space to support in to- 
day’s SSDs - for instance, a 1TB [6] SSD has 512MB of 
DRAM which can be used for storing the meta-data. Re- 
gardless of the actual amount of available space to store 
this meta-data, CA-SSD can avail of whatever space is 
allocated to it, and as we will show in the next subsec- 
tion, even “complete absence of value locality” makes 
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CA-SSD only slightly worse than a traditional SSD. 
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Figure 8: Impact of VL. zipf parameter on X-axis repre- 
sents the extent of VP skewness in the workload. Higher 
zipf paramter indicates higher skewness in VP. The aver- 
age response times on Y-axis are normalized with respect 
to average response times for traditional SSDs. Note 
that these response times are for unlimited cache-space. 
We observe similar response times for meta-data cache 
which can store 128K hashes. 


5.3. Impact of Value Locality on CA-SSD 


We next conduct a more extensive analysis of the impact 
of value locality on CA-SSD performance to demonstrate 
that it is beneficial across a broad spectrum of work- 
load behaviors and not just for the three real workloads 
used above which exhibit good value locality. One dif- 
ficulty in considering a wide range of workloads is the 
lack of real workload traces for which content of each 
write is made available in the trace (most traces con- 
tain just the timestamp, address and size fields). On 
the other hand, considering a purely synthetic workload, 
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Workload} Description | Size {Requests} % 
(GB) /(in mill.)/Writes 
financial OLTP —_|0.50| 6.50 | 79.60 | 
cello99 | HP-UXOS |0.46| 0.44 |70.79 | 
proxy | Proxy server |0.33| 2.44 |95.64 | 
hm H/W Monitor |2.43} 11.11 |54.74 | 
ts Terminal Server|0.91) 4.17 | 74.06 | 
mds Media Server |3.09] 2.89 | 70.46 | 
srcl |Source Control|1.47| 5.00 | 93.73 | 

















Table 3: Workload description. Apart from the above 7 
workloads, we use mail, web, and homes that were de- 
scribed in Section 3. The workload size represents the 
total number of unique logical addresses(LPNs) accessed 
in the trace. The logical address space exposed to the file 
system can be much larger. 


may mandate assumptions on parameters - such as ar- 
rival rate, sequentiality, temporal locality, etc. - over and 
beyond those pertaining to value locality. Instead, we 
pick a set of 10 real workload traces(refer to Table 2 
and Table 3) that have been studied in prior literature 
- financial from UMass [8], cello99 from HP Labs [2], 
proxy,hm,ts,mds and src] from MSR [9] including the 
three workloads(homes,web and mail) from FIU [26] . 
We use the arrival times, block addresses and sizes from 
these traces, and only synthesize the “content’(v) for 
the blocks using a zipf distribution, given as: P(v;) = 


N 
Cv;~%, where, C = 1/S° vu; “, N is the total unique 
1 


values in the workload and a is the zipf parameter rep- 
resenting the skewness in value popularity. Many prior 
studies [13] have shown content popularity can be char- 
acterized by this distribution. Furthermore, we vary 
the exponent(a) characterizing the distribution from 0 
(which corresponds to no VP) to 1.0 (which corresponds 
to a very highly skewed VP behavior). In the experi- 
ments, we use this zipf probability distribution to pick a 
value for each incoming request. This exercises only the 
popularity of values and ignores the spatial and tempo- 
ral dimensions of value locality, and can thus be viewed 
as a pessimistic evaluation of CA-SSD since any spa- 
tial/temporal VL will only benefit it further (and not af- 
fect the performance of a traditional SSD which only re- 
lies on LPN-based spatial/temporal locality). Figure 8 
shows mean response times for these workloads on CA- 
SSD normalized with respect to traditional SSD response 
times. Similar to results in Section 5.2, as VP increases, 
the response times for these workloads decreases. Fur- 
thermore, even when VP is low, the response times for 
CA-SSD and traditional SSDs are comparable. We ob- 
serve that when the workloads show no VP (a=0.0), the 


average response time of CA-SSD only increases by at- 
most 10% (for srcl). This is primarily due to the over- 
heads of the hashing unit for write requests which we 
have chosen conservatively. Thus, we expect the average 
response time to be lower with a more aggressive esti- 
mate (If needed, one could even explore the possibility of 
dynamically turning off CAS in CA-SSD in complete ab- 
sence of VL). On the other hand, for high VP (a = 1.0), 
we see tremendous benefits with CA-SSD. We observe 
around 25 times reduction in average response times for 
financial trace and on average all workloads show an im- 
provement of about 74%. Furthermore, the number of 
values which account for 50% of the write requests in 
hm workload decreases from 4.5M for no VP (a = 0.0) 
to 1.3M for moderate VP (a = 0.4), a reduction of ap- 
proximately 71%. This clearly illustrates that the ben- 
efits accrued through VP specifically and value locality 
in general, strengthen the case for adoption of content 
addressability in SSDs, paving the way for a new gener- 
ation of SSDs. 


6 Related Work 


Value Locality/Content Addressability: CAS has 
been extensively used in archival and backup systems 
such as Venti [42], Foundation [43], Pastiche [14] etc for 
space savings , Internet suspend/resume [27], LBFS [35] 
for saving network bandwidth file system and buffer 
cache design [35, 47, 34], etc. Some recent work has 
evaluated real-world workloads and demonstrated signif- 
icant value locality which bodes well for CA-SSD [26, 
36]. However, to the best of our knowledge, this paper is 
the first to focus on issues that arise when designing an 
SSD that uses CAS internally. 


Meta-data Management for CAS: The scalability of 
a system employing CAS depends on careful manage- 
ment of CAS related meta-data. Larger-sized chunks 
help in reducing the amount of meta-data to be stored 
while smaller chunks provide good duplicate elimina- 
tion. Pasta [33], Pastiche [14], REBL [29] and Foun- 
dation [43] have explored more complex chunking meth- 
ods. Bimodal chunking attempts to combine the benefits 
of two different chunk sizes [28] CA-SSD could bene- 
fit from all of these techniques and evaluating the bene- 
fits of different/variable chunk sizes is part of our future 
work. Sparse indexing divides the incoming data stream 
into large segments which are then de-duplicated against 
a few similar segments found using sampling [31]. Like 
sparse indexing, the degree of de-duplication in CA-SSD 
depends on the available meta-data cache space. Re- 
searchers have developed CAS meta-data management 
techniques which utilize HDD/SSDs for storing chunk 
indexes [48, 15]. These techniques utilize spatial lo- 
cality in data segments for reducing index lookups by 
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pre-fetching meta-data in RAM. Unlike these techniques, 
CA-SSD does away with index lookups on HDD/SSD 
and utilizes TVL for reducing meta-data misses. 


7 Conclusion 


Given evidence for the presence of significant VL in real- 
world workloads, we designed CA-SSD which employed 
CAS for its internal data management. Using three real- 
world workloads with content information, we devised 
statistical characterizations of two aspects of VL - value 
popularity and temporal VL - that formed the foundation 
of CA-SSD. The design of CA-SSD presented us with 
interesting choices and challenges related to exploiting 
VL for write reduction and maintaining meta-data con- 
sistency under constrained cache space. Using several 
real-world workloads, we conducted an extensive eval- 
uation of CA-SSD. We found significant improvements 
(59-84%) in average response times. Even for workloads 
with little or no value locality, we observed that CA-SSD 
continued to offer comparable performance to a tradi- 
tional SSD. 
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Abstract 


Reliably erasing data from storage media (sanitizing the 
media) is a critical component of secure data manage- 
ment. While sanitizing entire disks and individual files is 
well-understood for hard drives, flash-based solid state 
disks have a very different internal architecture, so it 
is unclear whether hard drive techniques will work for 
SSDs as well. 

We empirically evaluate the effectiveness of hard 
drive-oriented techniques and of the SSDs’ built-in san- 
itization commands by extracting raw data from the 
SSD’s flash chips after applying these techniques and 
commands. Our results lead to three conclusions: 
First, built-in commands are effective, but manufactur- 
ers sometimes implement them incorrectly. Second, 
overwriting the entire visible address space of an SSD 
twice is usually, but not always, sufficient to sanitize the 
drive. Third, none of the existing hard drive-oriented 
techniques for individual file sanitization are effective on 
SSDs. 

This third conclusion leads us to develop flash trans- 
lation layer extensions that exploit the details of flash 
memory’s behavior to efficiently support file sanitization. 
Overall, we find that reliable SSD sanitization requires 
built-in, verifiable sanitize operations. 


1 Introduction 


As users, corporations, and government agencies store 
more data in digital media, managing that data and access 
to it becomes increasingly important. Reliably remov- 
ing data from persistent storage is an essential aspect of 
this management process, and several techniques that re- 
liably delete data from hard disks are available as built-in 
ATA or SCSI commands, software tools, and government 
standards. 

These techniques provide effective means of sanitiz- 
ing hard disk drives (HDDs) — either individual files they 
store or the drive in their entirety. Software methods typ- 
ically involve overwriting all or part of the drive multiple 


times with patterns specifically designed to obscure any 
remnant data. The ATA and SCSI command sets include 
“secure erase” commands that should sanitize an entire 
disk. Physical destruction and degaussing are also effec- 
tive. 

Flash-based solid-state drives (SSDs) differ from hard 
drives in both the technology they use to store data (flash 
chips vs. magnetic disks) and the algorithms they use 
to manage and access that data. SSDs maintain a layer 
of indirection between the logical block addresses that 
computer systems use to access data and the raw flash 
addresses that identify physical storage. The layer of in- 
direction enhances SSD performance and reliability by 
hiding flash memory’s idiosyncratic interface and man- 
aging its limited lifetime, but it can also produce copies 
of the data that are invisible to the user but that a sophis- 
ticated attacker can recover. 

The differences between SSDs and hard drives make it 
uncertain whether techniques and commands developed 
for hard drives willl be effective on SSDs. We have de- 
veloped a procedure to determine whether a sanitization 
procedure is effective on an SSDs: We write a structured 
data pattern to the drive, apply the sanitization technique, 
dismantle the drive, and extract the raw data directly 
from the flash chips using a custom flash testing system. 

We tested ATA commands for sanitizing an entire 
SSD, software techniques to do the same, and software 
techniques for sanitizing individual files. We find that 
while most implementations of the ATA commands are 
correct, others contain serious bugs that can, in some 
cases, result in all the data remaining intact on the drive. 
Our data shows software-based full-disk techniques are 
usually, but not always, effective, and we have found evi- 
dence that the data pattern used may impact the effective- 
ness of overwriting. Single-file sanitization techniques, 
however, consistently fail to remove data from the SSD. 

Enabling single-file sanitization requires changes to 
the flash translation layer that manages the mapping be- 
tween logical and physical addresses. We have devel- 
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oped three mechanisms to support single-file sanitization 
and implemented them in a simulated SSD. The mecha- 
nisms rely on a detailed understanding of flash memory’s 
behavior beyond what datasheets typically supply. The 
techniques can either sacrifice a small amount of perfor- 
mance for continuous sanitization or they can preserve 
common case performance and support sanitization on 
demand. 

We conclude that the complexity of SSDs relative to 
hard drives requires that they provide built-in sanitiza- 
tion commands. Our tests show that since manufacturers 
do not always implement these commands correctly, the 
commands should be verifiable as well. Current and pro- 
posed ATA and SCSI standards provide no mechanism 
for verification and the current trend toward encrypting 
SSDs makes verification even harder. 

The remainder of this paper is organized as follows: 
Section 2 describes the sanitization problem in detail. 
Section 3 presents our verification methodology and re- 
sults for existing hard disk-oriented techniques. Sec- 
tion 4 describes our FTL extensions to support single-file 
sanitization, and Section 5 presents our conclusions. 


2 Sanitizing SSDs 


The ability to reliably erase data from a storage device 
is critical to maintaining the security of that data. This 
paper identifies and develops effective methods for eras- 
ing data from solid-state drives (SSDs). Before we can 
address these goals, however, we must understand what 
it means to sanitize storage. This section establishes 
that definition while briefly describing techniques used 
to erase hard drives. Then, it explains why those tech- 
niques may not apply to SSDs. 


2.1 Defining “sanitized” 


In this work, we use the term “sanitize” to describe the 
process of erasing all or part of a storage device so that 
the data it contained is difficult or impossible to recover. 
Below we describe five different levels of sanitization 
storage can undergo. We will use these terms to catego- 
rize and evaluate the sanitization techniques in Sections 3 
and 4. 

The first level is logical sanitization. Data in log- 
ically sanitized storage is not recoverable via standard 
hardware interfaces such as standard ATA or SCSI com- 
mands. Users can logically sanitize an entire hard drive 
or an individual file by overwriting all or part of the 
drive, respectively. Logical sanitization corresponds 
to “clearing” as defined in NIST 800-88 [25], one of 
several documents from governments around the world 
[11, 26, 9, 13, 17, 10] that provide guidance for data de- 
struction. 

The next level is digital sanitization. It is not possible 
to recover data from digitally sanitized storage via any 
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digital means, including undocumented drive commands 
or subversion of the device’s controller or firmware. On 
disks, overwriting and then deleting a file suffices for 
both logical and digital sanitization with the caveat that 
overwriting may not digitally sanitize bad blocks that the 
drive has retired from use. As we shall see, the complex- 
ity of SSDs makes digitally sanitizing them more com- 
plicated. 

The next level of sanitization is analog sanitization. 
Analog sanitization degrades the analog signal that en- 
codes the data so that reconstructing the signal is effec- 
tively impossible even with the most advanced sensing 
equipment and expertise. NIST 800-88 refers to analog 
sanitization as “purging.” 

An alternative approach to overwriting or otherwise 
obliterating bits is to cryptographically sanitize storage. 
Here, the drive uses a cryptographic key to encrypt and 
decrypt incoming and outgoing data. To sanitize the 
drive, the user issues a command to sanitize the storage 
that holds the key. The effectiveness of cryptographic 
sanitization relies on the security of the encryption sys- 
tem used (e.g., AES [24]), and upon the designer’s abil- 
ity to eliminate “side channel” attacks that might allow 
an adversary to extract the key or otherwise bypass the 
encryption. 

The correct choice of sanitization level for a partic- 
ular application depends on the sensitivity of the data 
and the means and expertise of the expected adversary. 
Many government standards [11, 26, 9, 13, 17, 10] and 
secure erase software programs use multiple overwrites 
to erase data on hard drives. As a result many individuals 
and companies rely on software-based overwrite tech- 
niques for disposing of data. To our knowledge (based 
on working closely with several government agencies), 
no one has ever publicly demonstrated bulk recovery of 
data from an HDD after such erasure, so this confidence 
is probably well-placed.!. 


2.2 SSD challenges 


The internals of an SSD differ in almost every respect 
from a hard drive, so assuming that the erasure tech- 
niques that work for hard drives will also work for SSDs 
is dangerous. 

SSDs use flash memory to store data. Flash memory is 
divided into pages and blocks. Program operations apply 
to pages and can only change Is to Os. Erase operations 
apply to blocks and set all the bits in a block to 1. Asa 
result, in-place update is not possible. There are typically 
64-256 pages in a block (see Table 5). 

A flash translation layer (FTL) [15] manages the map- 
ping between logical block addresses (LBAs) that are 
visible via the ATA or SCSI interface and physical pages 


‘Of course, there may have been non-public demonstration. 
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of flash memory. Because of the mismatch in granular- 
ity between erase operations and program operations in 
flash, in-place update of the sector at an LBA is not pos- 
sible. 

Instead, to modify a sector, the FTL will write the new 
contents for the sector to another location and update the 
map so that the new data appears at the target LBA. As a 
result, the old version of the data remains in digital form 
in the flash memory. We refer to these “left over’ data as 
digital remnants. 

Since in-place updates are not possible in SSDs, the 
overwrite-based erasure techniques that work well for 
hard drives may not work properly for SSDs. Those 
techniques assume that overwriting a portion of the LBA 
space results in overwriting the same physical media that 
stored the original data. Overwriting data on an SSD re- 
sults in logical sanitization (i.e., the data is not retrievable 
via the SATA or SCSI interface) but not digital sanitiza- 
tion. 

Analog sanitization is more complex for SSDs than for 
hard drives as well. Gutmann [20, 19] examines the prob- 
lem of data remnants in flash, DRAM, SRAM, and EEP- 
ROM, and recently, so-called “cold boot” attacks [21] re- 
covered data from powered-down DRAM devices. The 
analysis in these papers suggests that verifying analog 
sanitization in memories is challenging because there are 
many mechanisms that can imprint remnant data on the 
devices. 

The simplest of these is that the voltage level on an 
erased flash cell’s floating gate may vary depending on 
the value it held before the erase command. Multi-level 
cell devices (MLC), which store more than one bit per 
floating gate, already provide stringent control the volt- 
age in an erased cell, and our conversations with industry 
[1] suggest that a single erasure may be sufficient. For 
devices that store a single bit per cell (SLC) a single era- 
sure may not suffice. We do not address analog erasure 
further in this work. 

The quantity of digital remnant data in an SSD can be 
quite large. The SSDs we tested contain between 6 and 
25% more physical flash storage than they advertise as 
their logical capacity. Figure 1 demonstrates the exis- 
tence of the remnants in an SSD. We created 1000 small 
files on an SSD, dismantled the drive, and searched for 
the files’ contents. The SSD contained up to 16 stale 
copies of some of the files. The FTL created the copies 
during garbage collection and out-of-place updates. 

Complicating matters further, many drives encrypt 
data and some appear to compress data as well to im- 
prove write performance: one of our drives rumored to 
use compression is 25% faster for writes of highly com- 
pressible data than incompressible data. This adds an 
additional level of complexity not present in hard drives. 

Unless the drive is encrypted, recovering remnant data 
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Figure 1: Multiple copies This graph shows The FTL 
duplicating files up to 16 times. The graph exhibits a 
spiking pattern which is probably due to the page-level 
management by the FTL. 
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Figure 2: Ming the Merciless Our custom FPGA-based 
flash testing hardware provides direct access to flash 
chips without interference from an FTL. 


from the flash is not difficult. Figure 2 shows the FPGA- 
based hardware we built to extract remnants. It cost 
$1000 to build, but a simpler, microcontroller-based ver- 
sion would cost as little as $200, and would require only 
a moderate amount of technical skill to construct. 


These differences between hard drives and SSDs po- 
tentially lead to a dangerous disconnect between user 
expectations and the drive’s actual behavior: An SSD’s 
owner might apply a hard drive-centric sanitization tech- 
nique under the misguided belief that it will render the 
data essentially irrecoverable. In truth, data may remain 
on the drive and require only moderate sophistication to 
extract. The next section quantifies this risk by applying 
commonly-used hard drive-oriented techniques to SSDs 
and attempting to recover the “deleted” data. 
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88-byte fingerprint 
“Magic” Header (8 bytes 


512-Byte ATA Sector 
(88 bytes) 







Fingerprint 0 


Fingerprint 1 Generation # 











Fingerprint 2 
Iteration # 





Fingerprint 3 





Fingerprint 4 ae 





Padding (72 bytes) 


ya 


Figure 3: Fingerprint structure The easily-identified 
fingerprint simplifies the task of identifying and recon- 
structing remnant data. 


Bit Pattern 








3 Existing techniques 


This section describes our procedure for testing sanitiza- 
tion techniques and then uses it to determine how well 
hard drive sanitization techniques work for SSDs. We 
consider both sanitizing an entire drive at once and se- 
lectively sanitizing individual files. Then we briefly dis- 
cuss our findings in relation to government standards for 
sanitizing flash media. 


3.1 Validation methodology 


Our method for verifying digital sanitization operations 
uses the lowest-level digital interface to the data in an 
SSD: the pins of the individual flash chips. 

To verify a sanitization operation, we write an iden- 
tifiable data pattern called a fingerprint (Figure 3) to the 
SSD and then apply the sanitization technique under test. 
The fingerprint makes it easy to identify remnant digi- 
tal data on the flash chips. It includes a sequence num- 
ber that is unique across all fingerprints, byte patterns to 
help in identifying and reassembling fingerprints, and a 
checksum. It also includes an identifier that we use to 
identify different sets of fingerprints. For instance, all 
the fingerprints written as part of one overwrite pass or 
to a particular file will have the same identifier. Each 
fingerprint is 88 bytes long and repeats fives times in a 
512-byte ATA sector. 

Once we have applied the fingerprint and sanitized the 
drive, we dismantle it. We use the flash testing system 
in Figure 2 to extract raw data from its flash chips. The 
testing system uses an FPGA running a Linux software 
stack to provide direct access to the flash chips. 

Finally, we assemble the fingerprints and analyze them 
to determine if the sanitization was successful. SSDs 
vary in how they spread and store data across flash chips: 
some interleave bytes between chips (e.g., odd bytes on 
one chip and even bytes on another) and others invert 
data before writing. The fingerprint’s regularity makes 
it easy to identify and reassemble them, despite these 
complications. Counting the number of fingerprints that 
remain and categorizing them by their IDs allows us to 
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SSD Ctlr # SECURITY SEC. ERASE 
# & Type ERASE UNIT UNIT ENH 
A 1-MLC Not Supported | Not Supported 
B 2-SLC Failed Not Supported 
C 1-MLC Failedt Not Supported 
D 3-MLC Failedt Not Supported 
E 4-MLC Encryptedt Encryptedt 
F 5-MLC Success Success 
G 6-MLC Success Success 
H 7-MLC Success Success 
I 8-MLC Success Success 
Jx 9-TLC Not Supported | Not Supported 

Kx 10-MLC || Not Supported | Not Supported 
Lx 11-MLC || Not Supported | Not Supported 























*Drive reported success but all data remained on drive 

+Sanitization only successful under certain conditions 

£Drive encrypted, unable to verify if keys were deleted 
xUSB mass storage device does not support ATA security [30] 


Table 1: Built-in ATA sanitize commands Support for 
built-in ATA security commands varied among drives, 
and three of the drives tested did not properly execute 
a sanitize command it reported to support. 


measure the sanitization’s effectiveness. 


3.2. Whole-drive sanitization 


We evaluate three different techniques for sanitizing an 
entire SSD: issuing a built-in sanitize command, repeat- 
edly writing over the drive using normal IO operations, 
and degaussing the drive. Then we briefly discuss lever- 
aging encryption to sanitize SSDs. 


3.2.1 Built-in sanitize commands 


Most modern drives have built-in sanitize commands that 
instruct on-board firmware to run a sanitization proto- 
col on the drive. Since the manufacturer has full knowl- 
edge of the drive’s design, these techniques should be 
very reliable. However, implementing these commands 
is optional in the drive specification standards. For in- 
stance, removable USB drives do not support them as 
they are not supported under the USB Mass Storage De- 
vice class [30]. 

The ATA security command set specifies an “ERASE 
UNIT” command that erases all user-accessible areas on 
the drive by writing all binary zeros or ones [3]. There is 
also an enhanced “ERASE UNIT ENH” command that 
writes a vendor-defined pattern (presumably because the 
vendor knows the best pattern to eliminate analog rem- 
nants). The new ACS-2 specification [4], which is still 
in draft at the time of this writing, specifies a “BLOCK 
ERASE” command that is part of its SANITIZE feature 
set. It instructs a drive to perform a block erase on all 
memory blocks containing user data even if they are not 
user-accessible. 
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We collected 12 different SSDs and determined if they 
supported the security and sanitize feature sets. If the 
SSD supported the command, we verified effectiveness 
by writing a fingerprint to the entire drive several times 
and then issuing the command. Overwriting several 
times fills as much of the over-provision area as possi- 
ble with fingerprint data. 

Support and implementation of the built in commands 
varied across vendors and firmware revisions (Table 1). 
Of the 12 drives we tested, none supported the ACS-2 
“SANITIZE BLOCK ERASE” command. This is not 
surprising, since the standard is not yet final. Eight of the 
drives reported that they supported the ATA SECURITY 
feature set. One of these encrypts data, so we could not 
verify if the sanitization was successful. Of the remain- 
ing seven, only four executed the “ERASE UNIT” com- 
mand reliably. 

Drive B’s behavior is the most disturbing: it reported 
that sanitization was successful, but all the data remained 
intact. In fact, the filesystem was still mountable. Two 
more drives suffered a bug that prevented the ERASE 
UNIT command from working unless the drive firmware 
was recently reset, otherwise the command would only 
erase the first LBA. However, they accurately reported 
that the command failed. 

The wide variance among the drives leads us to con- 
clude that each implementation of the security com- 
mands must be individually tested before it can be trusted 
to properly sanitize the drive. 

In addition to the standard commands, several drive 
manufacturers also provide special utilities that issue 
non-standard erasure commands. We did not test these 
commands, but we expect that results would be similar 
to those for the ATA commands: most would work cor- 
rectly but some may be buggy. Regardless, we feel these 
non-standard commands are of limited use: the typical 
user may not know which model of SSD they own, let 
alone have the wherewithal to download specialized util- 
ities for them. In addition, the usefulness of the utility 
depends on the manufacture keeping it up-to-date and 
available online. Standardized commands should work 
correctly almost indefinitely. 


3.2.2 Overwrite techniques 


The second sanitization method is to use normal IO com- 
mands to overwrite each logical block address on the 
drive. Repeated software overwrite is at the heart of 
many disk sanitization standards [11, 26, 9, 13, 17, 10] 
and tools [23, 8, 16, 5]. All of the standards and tools 
we have examined use a similar approach: They sequen- 
tially overwrite the entire drive with between | and 35 bit 
patterns. The US Air Force System Instruction 5020 [2] 
is typical: It first fills the drive with binary zeros, then 
binary ones, and finally an arbitrary character. The data 


























SSD Seq. 20 Pass Rand. 20 Pass 

Init: Seq. Rand. Seq. Rand. 
A >20 N/Ax N/Ax N/Ax 
B 1 N/Ax N/Ax N/Ax 
C 2 2 2 2 
D 2 2 N/Ax N/Ax 
F 2 121 hr« 121 hrx« 121 hrx« 
J 2 70 hr.x 70 hr.x 70 hr. 
K 2 140 hr.« 140 hr.x« 140 hr.« 
L 2 58 hr.x 58 hr.x 58 hr.x 














«Insufficient drives to perform test 
x Test took too long to perform, time for single pass indicated. 


Table 2: Whole-disk software overwrite. The number 
in each column indicates the number of passes needed to 
erase data on the drive. Drives G through I encrypt, so 
we could not conclude anything about the success of the 
techniques. 


is then read back to confirm that only the character is 
present. 

The varied bit patterns aim to switch as many of the 
physical bits on the drive as possible and, therefore, make 
it more difficult to recover the data via analog means. 

Bit patterns are potentially important for SSDs as well, 
but for different reasons. Since some SSDs compress 
data before storing, they will write fewer bits to the flash 
if the data is highly compressible. This suggests that 
for maximum effectiveness, SSD overwrite procedures 
should use random data. However, only one of the drives 
we tested (Drive G) appeared to use compression, and 
since it also encrypts data we could not verify sanitiza- 
tion. 

Since our focus is on digital erasure, the bit patterns 
are not relevant for drives that store unencrypted, un- 
compressed data. This means we can evaluate overwrite 
techniques in general by simply overwriting a drive with 
many generations of fingerprints, extracting its contents, 
and counting the number of generations still present on 
the drive. If k generations remain, and the first genera- 
tion is completely erased, then k passes are sufficient to 
erase the drive. 

The complexity of SSD FTLs means that the usage 
history before the overwrite passes may impact the ef- 
fectiveness of the technique. To account for this, we pre- 
pared SSDs by writing the first pass of data either se- 
quentially or randomly. Then, we performed 20 sequen- 
tial overwrites. For the random writes, we wrote every 
LBA exactly once, but in a pseudo-random order. 

Table 2 shows the results for the eight non-encrypting 
drives we tested. The numbers indicate how many gen- 
erations of data were necessary to erase the drive. For 
some drives, random writes were prohibitively slow, tak- 
ing as long as 121 hours for a single pass, so we do not 
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perform the random write test on these drives. In most 
cases, overwriting the entire disk twice was sufficient to 
sanitize the disk, regardless of the previous state of the 
drive. There were three exceptions: about 1% (1 GB) 
of the data remained on Drive A after twenty passes. We 
also tested a commercial implementation of the four-pass 
5220.22-M standard [12] on Drive C. For the sequential 
initialization case, it removed all the data, but with ran- 
dom initialization, a single fingerprint remained. Since 
our testing procedure destroys the drive, we did not per- 
form some test combinations. 

Overall, the results for overwriting are poor: while 
overwriting appears to be effective in some cases across a 
wide range of drives, it is clearly not universally reliable. 
It seems unlikely that an individual or organization ex- 
pending the effort to sanitize a device would be satisfied 
with this level of performance. 


3.2.3 Degaussing 


We also evaluated degaussing as a method for erasing 
SSDs. Degaussing is a fast, effective means of destroy- 
ing hard drives, since it removes the disks low-level for- 
matting (along with all the data) and damages the drive 
motor. The mechanism flash memories use to store data 
is not magnetism-based, so we did not expect the de- 
gausser to erase the flash cells directly. However, the 
strong alternating magnetic fields that the degausser pro- 
duces will induce powerful eddy currents in chip’s metal 
layers. These currents may damage the chips, leaving 
them unreadable. 

We degaussed individual flash chips written with our 
fingerprint rather than entire SSDs. We used seven chips 
(marked with * in Table 5) that covered SLC, MLC and 
TLC (triple-level cell) devices across a range of process 
generation feature sizes. The degausser was a Security, 
Inc. HD-3D hard drive degausser that has been evalu- 
ated for the NSA and can thoroughly sanitize modern 
hard drives. It degaussed the chips by applying a rotating 
14,000 gauss field co-planar to the chips and an 8,000 
gauss perpendicular alternating field. In all cases, the 
data remained intact. 


3.2.4 Encryption 


Many recently-introduced SSDs encrypt data by default, 
because it provides increased security. It also provides a 
quick means to sanitize the device, since deleting the en- 
cryption key will, in theory, render the data on the drive 
irretrievable. Drive E takes this approach. 

The advantage of this approach is that it is very fast: 
The sanitization command takes less than a second for 
Drive E. The danger, however, is that it relies on the con- 
troller to properly sanitize the internal storage location 
that holds the encryption key and any other derived val- 
ues that might be useful in cryptanalysis. Given the bugs 
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we found in some implementations of secure erase com- 
mands, it is unduly optimistic to assume that SSD ven- 
dors will properly sanitize the key store. Further, there is 
no way verify that erasure has occurred (e.g., by disman- 
tling the drive). 

A hybrid approach called SAFE [29] can provide both 
speed and verifiability. SAFE sanitizes the key store and 
then performs an erase on each block in a flash storage 
array. When the erase is finished, the drive enters a “‘ver- 
ifiable” state. In this state, it is possible to dismantle the 
drive and verify that the erasure portion of the sanitiza- 
tion process was successful. 


3.3 Single-file sanitization 


Sanitizing single files while leaving the rest of the data 
in the drive intact is important for maintaining data se- 
curity in drives that are still in use. For instance, users 
may wish to destroy data such as encryption keys, finan- 
cial records, or legal documents when they are no longer 
needed. Furthermore, for systems such as personal com- 
puters and cell phone where the operating system, pro- 
grams, and user data all reside on the same SSD, sani- 
tizing single files is the only sanitization option that will 
leave the system in a usable state. 

Erasing a file is a more delicate operation than eras- 
ing the entire drive. It requires erasing data from one 
or more ranges of LBAs while leaving the rest of the 
drive’s contents untouched. Neither hard disks nor SSDs 
include specialized commands to erase specific regions 
of the drive?. 

Many software utilities [14, 5, 28, 23] attempt to san- 
itize individual files. All of them use the same approach 
as the software-based full-disk erasure tools: they over- 
write the file multiple times with multiple bit patterns and 
then delete it. Other programs will repeatedly overwrite 
the free space (i.e., space that the file system has not allo- 
cated to a file) on the drive to securely erase any deleted 
files. 

We test 13 protocols, published as a variety of gov- 
ernment standards, as well as commercial software de- 
signed to erase single files. To reduce the number of 
drives needed to tests these techniques, we tested multi- 
ple techniques simultaneously on one drive. We format- 
ted the drive under windows and filled a series of 1 GB 
files with different fingerprints. We then applied one era- 
sure technique to each file, disassembled the drive, and 
searched for the fingerprints. 

Because we applied multiple techniques to the drive at 
once, the techniques may interact: If the first technique 
leaves data behind, a later technique might overwrite it. 
However, the amount of data we recover from each file 


2The ACS-2 draft standard [4] provide a “TRIM” command that 
informs drive that a range of LBAs is no longer in use, but this does not 
have any reliable effect on data security. 
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Overwrite operation Data recovered 
SSDs USB 
Filesystem delete 4.3 - 91.3% 99.4% 
Gutmann [19] 0.8 - 4.3% 71.7% 
Gutmann “Lite” [19] 0.02 - 8.7% 84.9% 
US DoD 5220.22-M (7) [11] 0.01 - 4.1% 0.0 - 8.9% 
RCMP TSSIT OPS-II [26] 0.01 - 9.0% 0.0 - 23.5% 
Schneier 7 Pass [27] 1.7 - 8.0% 0.0 - 16.2% 
German VSITR [9] 5.3 - 5.7% 0.0 - 9.3% 
US DoD 5220.22-M (4) [11] 5.6 - 6.5% 0.0 - 11.5% 
British HMG IS5 (Enh.) [14] 4.3 - 7.6% 0.0 - 34.7% 
US Air Force 5020 [2] 5.8 - 7.3% 0.0 - 63.5% 
US Army AR380-19 [6] 6.91 - 7.07% 1.1% 
Russian GOST P50739-95 [14] 7.07 - 13.86% 1.1% 
British HMG IS5 (Base.) [14] 6.3 - 58.3% 0.6% 
Pseudorandom Data [14] 6.16 - 75.7% 1.1% 
Mac OS X Sec. Erase Trash [5] 67.0% 9.8% 








Table 3: Single-file overwriting. None of the protocols 
tested successfully sanitized the SSDs or the USB drive 
in all cases. The ranges represent multiple experiments 


with the same algorithm (see text). 
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Drive Overwrites Free Space Recovered 
C (SSD) 100x 20 MB 87% 
Cc 100x 19,800 MB 719% 
Cc 100x + defrag. 20 MB 86% 
L (USB key) 100x 6 MB 64% 
L 100x 500 MB 53% 
L 100x + defrag. 6 MB 62% 

















Table 4: Free space overwriting Free space overwrit- 
ing left most of the data on the drive, even with varying 
amounts of free space. Defragmenting the data had only 
a small effect on the data left over (1%). 


is a lower bound on amount left after the technique com- 
pleted. To moderate this effect, we ran the experiment 
three times, applying the techniques in different orders. 
One protocol, described in 1996 by Gutmann [19], in- 
cludes 35 passes and had a very large effect on mea- 
surements for protocols run immediately before it, so we 
measured its effectiveness on its own drive. 


All single-file overwrite sanitization protocols failed 
(Table 3): between 4% and 75% of the files’ contents 
remained on the SATA SSDs. USB drives performed no 
better: between 0.57% and 84.9% of the data remained. 


Next, we tried overwriting the free space on the drive. 
In order to simulate a used drive, we filled the drive 
with small (4 KB) and large files (512 KB+). Then, we 
deleted all the small files and overwrote the free space 
100 times. Table 4 shows that regardless of the amount 
of free space on the drive, overwriting free space was not 
successful. Finally, we tried defragmenting the drive, 
reasoning that rearranging the files in the file system 
might encourage the FTL to reuse more physical storage 
locations. The table shows this was also ineffective. 


3.4 Sanitization standards 


Although many government standards provide guidance 
on storage sanitization, only one [25] (that we are aware 
of) provides guidance specifically for SSDs and that is 
limited to “USB Removable Disks.’ Most standards, 
however, provide separate guidance for magnetic media 
and flash memory. 

For magnetic media such as hard disks, the standards 
are consistent: overwrite the drive a number of times, 
execute the built-in secure erase command and destroy 
the drive, or degauss the drive. For flash memory, how- 
ever, the standards do not agree. For example, NIST 800- 
88 [25] suggests overwriting the drive, Air Force Sys- 
tem Security Instruction 5020 suggests “[using] the erase 
procedures provided by the manufacturer’ [2], and the 
DSS Clearing & Sanitization matrix [11] suggests “per- 
form[ing] a full chip erase per manufacturer’s datasheet.” 

None of these solutions are satisfactory: Our data 
shows that overwriting is ineffective and that the “erase 
procedures provided by the manufacturer” may not work 
properly in all cases. The final suggestion to perform a 
chip erase seems to apply to chips rather than drives, and 
it is easy to imagine it being interpreted incorrectly or 
applied to SSDs inappropriately. Should the user consult 
the chip manufacturer, the controller manufacturer, or the 
drive manufacturer for guidance on sanitization? 

We conclude that the complexity of SSDs relative to 
hard drives requires that they provide built-in sanitiza- 
tion commands. Since our tests show that manufacturers 
do not always implement these commands correctly, they 
should be verifiable as well. Current and proposed ATA 
and SCSI standards provide no mechanism for verifica- 
tion and the current trend toward encrypting SSDs makes 
verification even harder. 

Built-in commands for whole disk sanitization appear 
to be effective, if implemented correctly. However, no 
drives provide support for sanitizing a single file in iso- 
lation. The next section explores how an FTL might sup- 
port this operation. 


4 Erasing files 


The software-only techniques for sanitizing a single file 
we evaluated in Section 3 failed because FTL complexity 
makes it difficult to reliably access a particular physical 
storage location. Circumventing this problem requires 
changes in the FTL. Previous work in this area [22] used 
encryption to support sanitizing individual files in a file 
system custom built for flash memory. This approach 
makes recovery from file system corruption difficult and 
it does not apply to generic SSDs. 

This section describes FTL support for sanitizing ar- 
bitrary regions of an SSD’s logical block address space. 
The extensions we describe leverage detailed measure- 
ments of flash memory characteristics. We briefly de- 
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Chip Name Max Tech | Cap. Page Pages | Blocks | Planes | Dies Die 
Cycles Node | (Gb) | Size (B) | /Block | /Plane /Die Cap (Gb) 
C-TLC16* * 43nm 16 8192 * 8192 * 1 16 
B-MLC32-4* 5,000 34nm | 128 4096 256 2048 2 4 32 
B-MLC32-1* 5,000 34nm | 32 4096 256 2048 2 1 32 
F-MLC16* 5,000 | 41 nm 16 4096 128 2048 2 1 16 
A-MLC16* 10,000 x 16 4096 128 2048 2 1 16 
B-MLC16* 10,000 | 50nm | 32 4096 128 2048 2 2 16 
C-MLC16* x x 32 4096 * * * 2 16 
D-MLC16* 10,000 x 32 4096 128 4096 1 2 16 
E-MLC16™* TBD * 64 4096 128 2048 2 4 16 
B-MLC8* 10,000 | 72 nm 8 2048 128 4096 1 1 8 
E-MLC4* 10,000 x 8 4096 128 1024 1 2 4 
E-SLC8"™* 100,000 * 16 4096 64 2048 2 2 8 
A-SLC8* 100,000 * 8 2048 64 4096 2 1 8 
A-SLC4* 100,000 * 4 2048 64 4096 1 1 4 
B-SLC2* 100,000 | 50 nm 2 2048 64 2048 1 1 2 
B-SLC4* 100,000 | 72 nm 4 2048 64 2048 2 1 4 
E-SLC4* 100,000 * 8 2048 64 4096 1 2 4 
A-SLC2* 100,000 * 2 2048 64 1024 2 1 2 























*Chips tested for data scrubbing. 


‘Chips tested for degaussing. 


x No data available 


Table 5: Flash Chip Parameters. Each name encodes the manufacturer, cell type and die capacity in Gbits. Parame- 
ters are drawn from datasheets where available. We studied 18 chips from 6 manufacturers. 


scribe our baseline FTL and the details of flash behav- 
ior that our technique relies upon. Then, we present and 
evaluate three ways an FTL can support single-file sani- 


tization. 


4.1 The flash translation layer 


We use the FTL described in [7] as a starting point. The 
FTL is page-based, which means that LBAs map to in- 
dividual pages rather than blocks. It uses log-structured 
writes, filling up one block with write data as it arrives, 
before moving on to another. As it writes new data for 
an LBA, the old version of the data becomes invalid but 
remains in the array (i.e., it becomes remnant data). 

When a block is full, the FTL must locate a new, 
erased block to continue writing. It keeps a pool of 
erased blocks for this purpose. If the FTL starts to 
run short of erased blocks, further incoming accesses 
will stall while it performs garbage collection by con- 
solidating valid data and freeing up additional blocks. 
Once its supply of empty blocks is replenished, it re- 
sumes processing requests. During idle periods, it per- 
forms garbage collection in the background, so blocking 
is rarely needed. 

To rebuild the map on startup, the FTL stores a reverse 
map (from physical address to LBA) in a distributed fash- 
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ion. When the FTL writes data to a page, the FTL writes 
the corresponding LBA to the page’s out-of-band sec- 
tion. To accelerate the start-up scan, the FTL stores a 
summary of this information for the entire block in the 
block’s last page. This complete reverse map will also 
enable efficiently locating all copies of an LBA’s data in 
our scan-based scrub technique (See Section 4.4). 


4.2 Scrubbing LBAs 


Sanitizing an individual LBA is difficult because the 
flash page it resides in may be part of a block that con- 
tains useful data. Since flash only supports erasure at the 
block level, it is not possible to erase the LBA’s contents 
in isolation without incurring the high cost of copying the 
entire contents of the block (except the page containing 
the target LBA) and erasing it. 

However, programming individual pages is possible, 
so an alternative would be to re-program the page to turn 
all the remaining Is into Os. We call this scrubbing the 
page. A scrubbing FTL could remove remnant data by 
scrubbing pages that contain stale copies of data in the 
flash array, or it could prevent their creation by scrubbing 
the page that contained the previous version whenever it 
wrote a new one. 


The catch with scrubbing is that manufacturer 


USENIX Association 


USENIX Association 








Program 
Random Data 


v 


For all pages 
in the Block 


v 


Scrub Randomly 
Selected Page 


Erase Block 





























Read All Pages 
for Errors 














Figure 4: Testing data scrubbing To determine whether 
flash devices can support scrubbing we programmed 
them with random data, randomly scrubbed pages one 
at a time, and then checked for errors. 


datasheets require programming the pages within a block 
in order to reduce the impact of program disturb effects 
that can increase error rates. Scrubbing would violate 
this requirement. However, previous work [18] shows 
that the impact of reprogramming varies widely between 
pages and between flash devices, and that, in some cases, 
reprogramming (or scrubbing) pages would have no ef- 
fect. 

To test this hypothesis, we use our flash testing board 
to scrub pages on 16 of the chips in Table 5 and measure 
the impact on error rate. The chips span six manufac- 
turers, five technology nodes and include both MLC and 
SLC chips. 

Figure 4 describes the test we ran. First, we erase the 
block and program random data to each of its pages to 
represent user data. Then, we scrub the pages in ran- 
dom order. After each scrub we read all pages in the 
block to check for errors. Flash blocks are independent, 
so checking for errors only within the block is sufficient. 
We repeated the test across 16 blocks spread across each 
chip. 

The results showed that, for SLC devices, scrubbing 
did not cause any errors at all. This means that the num- 
ber scrubs that are acceptable — the scrub budget — for 
SLC chips is equal to the number of pages in a block. 

For MLC devices determining the scrub budget is 
more complicated. First, scrubbing one page invariably 
caused severe corruption in exactly one other page. This 
occurred because each transistor in an MLC array holds 
two bits that belong to different pages, and scrubbing one 
page reliably corrupts the other. Fortunately, it is easy to 
determine the paired page layout in all the chips we have 
tested, and the location of the paired page of a given page 
is fixed for a particular chip model. The paired page ef- 
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Figure 5: Behavior under data scrubbing Scrubbing 
causes more errors in some chips than others, resulting 
in wide variation of scrub budgets for MLC devices. 


fect means that the FTL must scrub both pages in a pair 
at the same time, relocating the data in the page that was 
not the primary target of the scrub. 

Figure 5 shows bit error rates for MLC devices as a 
function of scrub count, but excluding errors in paired 
pages. The data show that for three of the nine chips we 
tested, scrubbing caused errors in the unscrubbed data in 
the block. For five of the remaining devices errors start to 
appear after between 2 and 46 scrubs. The final chip, B- 
MLC32-1, showed errors without any scrubbing. For all 
the chips that showed errors, error rates increase steeply 
with more scrubbing (the vertical axis is a log scale). 

It may be possible to reduce the impact of scrubbing 
(and, therefore, increase the scrub budget) by carefully 
measuring the location of errors caused by scrubbing a 
particular page. Program disturb effects are strongest 
between physically adjacent cells, so the distribution of 
scrubs should affect the errors they cause. As a result, 
whether scrubbing page is safe would depend on which 
other pages the FTL has scrubbed in the block, not the 
number of scrubs. 

The data in the figure also show that denser flash de- 
vices are less amenable to scrubbing. The chips that 
showed no errors (B-MLC16, D-MLC16, and B-MLC8) 
are 50 nm or 70 nm devices, while the chips with the 
lowest scrub budgets (F-MLC16, B-MLC32-4, and B- 
MLC32-1) are 34 or 41 nm devices. 


4.3 Sanitizing files in the FTL 


The next step is to use scrubbing to add file sanitization 
support to our FTL. We consider three different methods 
that make different trade-offs between performance and 
data security — immediate scrubbing, background scrub- 
bing, and scan-based scrubbing. 
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Name Total Accesses | Reads Description 
Patch 64 GB 83% | Applies patches to the Linux kernel from version 2.6.0 to 2.6.29 
OLTP 34 GB 80% Real-time processing of SQL transactions 
Berkeley-DB Btree 34 GB 34% Transactional updates to a B+tree key/value store 

Financial 17 GB 15% Live OLTP trace for financial transactions. 

Build 5.5 GB 94% Compilation of the Linux 2.6 kernel 
Software devel. 1.1 GB 65% 24 hour trace of a software development work station. 

Swap 800 MB 84% Virtual memory trace for desktop applications. 








Table 6: Benchmark and application traces We use traces from eight benchmarks and workloads to evaluate scrub- 


bing. 


These methods will eliminate all remnants in the 
drive’s spare area (i.e., that are not reachable via a log- 
ical block address). As a result, if a file system does 
not create remnants on a normal hard drive (e.g., if the 
file system overwrite a file’s LBAs when it performs a 
delete), it will not create remnants when running on our 
FTL. 

Immediate scrubbing provides the highest level of se- 
curity: write operations do not complete until the scrub- 
bing is finished — that is, until FTL has scrubbed the page 
that contained the old version of the LBA’s contents. In 
most cases, the performance impact will be minimal be- 
cause the FTL can perform the scrub and the program in 
parallel. 

When the FTL exceeds the scrub budget for a block, 
it must copy the contents of the block’s valid pages to a 
new block and then erase the block before the operation 
can complete. As a result, small scrub budgets (as we 
saw for some MLC devices) can degrade performance. 
We measure this effect below. 

Background scrubbing provides better performance by 
allowing writes to complete and then performing the 
scrubbing in the background. This results in a brief win- 
dow when remnant data remains on the drive. Back- 
ground scrubbing can still degrade performance because 
the scrub operations will compete with other requests for 
access to the flash. 

Scan-based scrubbing incurs no performance overhead 
on normal write operations but adds a command to sani- 
tize a range of LBAs by overwriting the current contents 
of the LBAs with zero and then scrubbing any storage 
that previously held data for the LBAs. This technique 
exploits the reverse (physical to logical) address map 
that the SSD stores to reconstruct the logical-to-physical 
map. 

To execute a scan-based scrubbing command, the FTL 
reads the summary page from each block and checks if 
any of the pages in the block hold a copy of an LBA that 
the scrub command targets. If it does, the FTL scrubs 
that page. If it exceeds the scrub budget, the FTL will 
need to relocate the block’s contents. 
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We also considered an SSD command that would ap- 
ply scrubbing to specific write operations that the op- 
erating system or file system marked as “sanitizing.” 
However, immediate and background scrubbing work by 
guaranteeing that only one valid copy of an LBA exists 
by always scrubbing old version when writing the new 
version. Applying scrubbing to only a subset of writes 
would violate this invariant and allow the creation of 
remnants that a single scrub could not remove. 


4.4 Results 


To understand the performance impact of our scrubbing 
techniques, we implemented them in a trace-based FTL 
simulator. The simulator implements the baseline FTL 
described above and includes detailed modeling of com- 
mand latencies (based on measurements of the chips in 
Table 5) and garbage collection overheads. For these ex- 
periments we used E-SLC8 to collect SLC data and F- 
MLC16 to for MLC data. We simulate a small, 16 GB 
SSD with 15% spare area to ensure that the FTL does 
frequent garbage collection even on the shorter traces. 

Table 6 summarizes the eight traces we used in our 
experiments. They cover a wide range of applications 
from web-based services to software development to 
databases. We ran each trace on our simulator and report 
the latency of each FTL-level page-sized access and trace 
run time. Since the traces include information about 
when each the application performed each IO, the change 
in trace run-time corresponds to application-level perfor- 
mance changes. 


Immediate and background scrubbing Figure 6 
compares the write latency for immediate and back- 
ground scrubbing on SLC and MLC devices. For MLC, 
we varied the number of scrubs allowed before the FTL 
must copy out the contents of the block. The figure nor- 
malizes the data to the baseline configuration that does 
not perform scrubbing or provide any protection against 
remnant data. 

For SLC-based SSDs, immediate scrubbing causes no 
decrease in performance, because scrubs frequently exe- 
cute in parallel with the normal write access. 
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Figure 6: Immediate and background scrubbing performance For chips that can withstand at least 64 scrub opera- 
tions, both background and immediate scrubbing can prevent the creation of data remnants with minimal performance 
impact. For SLC devices (which can support unlimited scrubbing), background scrubbing has almost no effect and 


immediate scrubbing increases write latency by about 2x. 


In MLC devices, the cost of immediate scrubbing can 
be very high if the chip can tolerate only a few scrubs be- 
fore an erase. For 16 scrubs, operation latency increases 
by 6.4x on average and total runtime increases by up to 
11.0x, depending on the application. For 64 scrubs, the 
cost drops to 2.0x and 3.2.x, respectively. 

However, even a small scrub budget reduces latency 
significantly compared relying on using erases (and the 
associated copy operations) to prevent remnants. Tm- 
plementing immediate sanitization with just erase com- 
mands increases operation latency by 130x on average 
(as shown by the “Scrub 0” data in Figure 5). 

If the application allows time for background opera- 
tions (e.g., Build, Swap and Dev), background scrub- 
bing with a scrub budget of 16 or 64 has a negligible ef- 
fect on performance. However, when the application is- 
sues many requests in quick succession (e.g., OLTP and 
BDB), scrubbing in the background strains the garbage 
collection system and write latencies increase by 126x 
for 16 scrubs and 85x for 64 scrubs. In contrast, slow- 
down for immediate scrubbing range from just 1.9 to 
2.0x for a scrub budget of 64 and from 4.1 to 7.9x for 
16 scrubs. 

Scrubbing also increases the number of erases re- 
quired and, therefore, speeds up program/erase-induced 
wear out. Our results for MLC devices show that scrub- 
bing increased wear by 5.1 x for 16 scrubs per block and 
2.0 with 64 scrubs per block. Depending on the appli- 
cation, the increased wear for chips that can tolerate only 
a few scrubs may or may not be acceptable. Scrubbing 
SLC devices does not require additional erase operations. 


Finally, scrubbing may impact the long-term integrity 
of data stored in the SSD in two ways. First, although 
manufactures guarantee that data in brand new flash de- 
vices will remain intact for at least 10 years, as the chip 
ages data retention time drops. As a result, the increase 
in wear that scrubbing causes will reduce data retention 
time over the lifetime of the SSD. Second, even when 
scrubbing does not cause errors immediately, it may af- 
fect the analog state of other cells, making it more likely 
that they give rise to errors later. Figure 6 demonstrates 
the analog nature of the effect: B-MLC32-4 shows errors 
that come and go for eight scrubs. 


Overall, both immediate and background scrubbing 
are useful options for SLC-based SSDs and for MLC- 
based drives that can tolerate at least 64 scrubs per block. 
For smaller scrub budgets, both the increase in wear 
and the increase in write latency make these techniques 
costly. Below, we describe another approach to sanitiz- 
ing files that does not incur these costs. 


Scan-based scrubbing Figure 7 measures the latency 
for a scan-based scrubbing operation in our FTL. We ran 
each trace to completion and then issued a scrub com- 
mand to | GB worth of LBAs from the middle of the de- 
vice. The amount of scrubbing that the chips can tolerate 
affects performance here as well: scrubbing can reduce 
the scan time by as much as 47%. However, even for the 
case where we must use only erase commands (MLC- 
scrub-0), the operation takes a maximum of 22 seconds. 
This latency breaks down into two parts — the time re- 
quired to scan the summary pages in each block (0.64 s 
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Figure 7: Scan-based scrubbing latency The time to 
scrub | GB varies with the number of scrubs each block 
can withstand, but in all cases the operation takes less 
than 30 seconds. 


for our SLC SSD and 1.3 s for MLC) and the time to per- 
form the scrubbing operations and the resulting garbage 
collection. The summary scan time will scale with SSD 
size, but the scrubbing and garbage collection time are 
primarily a function of the size of the target LBA region. 
As a result, scan-based scrubbing even on large drives 
will be quick (e.g., ~62 s for a 512 GB drive). 


5 Conclusion 


Sanitizing storage media to reliably destroy data is an 
essential aspect of overall data security. We have em- 
pirically measured the effectiveness of hard drive-centric 
sanitization techniques on flash-based SSDs. For san- 
itizing entire disks, built-in sanitize commands are ef- 
fective when implemented correctly, and software tech- 
niques work most, but not all, of the time. We found that 
none of the available software techniques for sanitizing 
individual files were effective. To remedy this problem, 
we described and evaluated three simple extensions to an 
existing FTL that make file sanitization fast and effec- 
tive. Overall, we conclude that the increased complexity 
of SSDs relative to hard drives requires that SSDs pro- 
vide verifiable sanitization operations. 
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Abstract 


Exploiting spatial locality is critical for a disk scheduler to 
achieve high throughput. Because of the high cost of disk 
head seeks and the non-preemptible nature of request ser- 
vice, state-of-the-art disk schedulers consider the locality 
of both pending and future requests. Though schedulers 
adopting the approach, such as the anticipatory scheduler, 
show substantial performance advantages, they need to 
know from which processes requests are issued to evaluate 
locality. This approach is not effective when the knowl- 
edge about processes is not available (e.g., in virtual ma- 
chine environment, network or parallel file systems, and 
SAN) or the locality exhibited on a disk region is not solely 
determined by individual processes (e.g., in the case of co- 
operative process groups and disk array where requested 
data are striped). 

We propose a light-weight disk scheduling framework 
that does not require any process knowledge for analyzing 
request locality. Solely based on requests’ own characteris- 
tics the framework can make any work-conserving sched- 
uler non-work-conserving, i.e., able to take future requests 
as dispatching candidates, to fully exploit locality. Addi- 
tionally, we show how to effectively extend the framework 
to the disk array environment. Our design, Stream Schedul- 
ing, is prototyped in the Linux kernel 2.6.31. With ex- 
tensive experiments of representative benchmarks, and in 
various environments such as the Xen virtual machine and 
the PVFS parallel file system, we show that the proposed 
scheduling framework can improve their performance by 
up to 3.2 times. 


1 Introduction 


While the hard disk has maintained exponential growth in 
capacity as a function of time, and sustained improvement 
in peak throughput, its random access performance, which 
is mainly determined by disk seek time, is increasingly a 
bottleneck. This makes the disk scheduler, which aims to 
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minimize disk seeks by exploiting spatial locality in the 
requests, increasingly important to disk performance. 


1.1 Non-work-conserving Disk Scheduling 


Traditionally a disk scheduler such as CSCAN and SPTF 
chooses a request from those that have arrived and are 
pending in its dispatch queue and dispatches it to the disk. 
In a work-conserving mode, the scheduler must choose one 
of the pending requests, if any, to dispatch, even if the 
pending requests are far away from the current disk head 
position. The rationale for non-work-conserving sched- 
ulers, such as the anticipatory scheduler (AS) [16] and 
Completely Fair Queuing (CFQ) [1], is that a request that 
is soon to arrive might be much closer to the disk head 
than the currently pending requests, in which case it may 
be worthwhile to wait for the future request.' If such a re- 
quest does arrive soon and the benefit of avoiding the long- 
distance disk seek outweighs the cost of idle waiting, the 
decision to keep the disk head in place may be justified. 
This is commonly observed when there are multiple pro- 
cesses concurrently issuing synchronous requests. For a re- 
quest synchronously issued by a process, the scheduler can 
see its next request only after the request is served. Without 
a short waiting period the spatial locality of requests from 
such a process cannot be exploited. In this context the spa- 
tial locality refers to the fact that nearby disk locations are 
likely to be accessed by two consecutive requests within a 
short period of time. A process has strong locality if soon 
after its current request is completed, the scheduler will re- 
ceive its next request for a location close to the current re- 
quest. While the traditional scheduler selects a request for 
dispatching only from currently pending requests, a non- 
work-conserving scheduler, in essence, selects one from 
currently pending requests and future requests to exploit 
locality among synchronously issued requests. 


'Descriptions of requests’ statuses, such as “currently pending” or 
“future requests”, are relative to the time when a scheduling decision is 
being made. 
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1.2 The Issues 


To be effective, a non-work-conserving scheduler needs to 
predict how long it will take for the next nearby request 
to arrive—the strength of the process’s locality—with rea- 
sonable accuracy, so that a decision can be whether to 
wait, and if so, for how long. To this end, existing non- 
work-conserving schedulers, such as AS and CFQ, group 
requests according to their issuing processes, analyze lo- 
cality for each group, and make predictions for each pro- 
cess. While analyzing and utilizing locality in the context 
of process is an intuitive and convenient choice, there are 
three scenarios that challenge this practice. 

First, if the requests to a limited disk region are from 
multiple processes, the locality, which is the basis for any 
scheduler to make scheduling decisions, is the result of 
these processes’ combined I/O behaviors. This is espe- 
cially the case when these processes coordinate to issue 
their requests. To determine whether the disk head should 
wait for a future request, the scheduler cares only about 
the probability for a nearby request to appear quickly, re- 
gardless of whether the request is from the same process. 
Limiting locality analysis to each individual process may 
underestimate the locality actually available to the sched- 
uler and lose opportunity for seek reduction. 

Second, in many important system settings process in- 
formation is not available to the disk scheduler. For exam- 
ple, in the virtual machine environment only the scheduler 
in the host OS or VMM can actually dispatch I/O requests 
to the disk, on behalf of guest VMs where processes run 
and generate the requests. The scheduler in the host usu- 
ally can only tell from which VM it receives a request but 
cannot distinguish from which process on a VM the re- 
quest is issued. When there are multiple processes running 
on a VM, lack of such knowledge at the host would make 
non-work-conserving host scheduler less effective. In dis- 
tributed or parallel file systems such as NFS and PVFS, the 
daemon at the file server receives requests from the clients 
and passes them to the disk scheduler without telling it 
which processes at the client side actually issued them. 
For another example, the SAN system and hardware RAID 
have internal disk schedulers that are critical to the sys- 
tems’ efficiency. The system interface for through which 
I/O requests are accepted usually does not include process 
information about request source. 

Third, one of assumptions made by _ non-work- 
conserving schedulers is that it is solely the process that 
determines how long it will take for its next request to be 
issued. For this reason, thinktime, the time period between 
two consecutive I/O calls of a process, is treated as an at- 
tribute of the process and is estimated using the process’s 
history information to predict when its next request will 
arrive. However, if the disk is a member of a disk ar- 
ray over which data are striped, the next several requests 
from the process might go to other disks in the array and 
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may not be immediately scheduled for those disks. Conse- 
quently, the timing for this disk to see its next request from 
the process is determined not only by the process’s think- 
times, but also by the data striping pattern on the array as 
well as the scheduling decisions made at the other disks. 
By mistaking the time period between two consecutive re- 
quests from a process for the process’s thinktime, a disk’s 
scheduler finds little opportunity for non-work-conserving 
scheduling. However, the fact is that by coordinating the 
scheduling of disks in the array, it is possible to reduce the 
time period so that waiting for the next request can still be 
beneficial. 


1.3. The Challenges 


To address these issues, we have to give up the assump- 
tion on the availability of process information. Specially, 
a scheduler is still expected to take future requests into 
account when making scheduling decisions, even without 
the process information, so that the most suitable request 
among both currently pending requests and future requests 
can be selected for dispatching. There are several critical 
challenges in achieving this objective. 

First, if locality were to be explicitly analyzed for pre- 
dicting timing and location of the next request, we have to 
group requests according to some criteria to track locality 
for each group of requests. However, without process in- 
formation, for any artificial grouping method it would be 
hard to accurately predict whether a request would appear 
whose locality is stronger than any of currently pending 
requests. For example, a seemingly effective method is 
to divide the disk into different regions, either evenly or 
accordingly to request concentrations, and then track lo- 
cality in each region. However, if the region were set too 
small, one process’s synchronous requests could span mul- 
tiple regions, which makes the arrival of the next request 
in a region too late and thus the locality in each region too 
weak. If the region were set too large, requests in a large 
disk area would be included for locality tracking, making 
the measured locality weak because of large inter-request 
distance. In both cases the scheduler may lose the oppor- 
tunity to schedule future requests. In addition, region size 
may have to be dynamically adjusted according to chang- 
ing request distribution on disk, making meaningful local- 
ity analysis yet more difficult. 

Second, locality is relative. When there are pending re- 
quests relatively close to the current disk head, the sched- 
uler must evaluate only the probability of requests of strong 
locality, and the relatively remote requests become less rel- 
evant. In contrast, if pending requests are relatively remote, 
even some not-very-close requests need to be included for 
locality analysis so as not to lose opportunity for higher 
disk efficiency. Therefore one must determine which re- 
quests should be included in an analysis adapting to the lo- 
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cations of pending requests. This would significantly add 
to the complexity and cost of such algorithms. 

Third, for data striped on a disk array, even if think- 
times can be sufficiently short for I/O-intensive applica- 
tions, the time gaps between two continuous requests seen 
at each disk can be too large to be exploited by non-work- 
conserving schedulers at individual disks. In this case the 
challenge is whether it is possible to reduce the time gaps 
by coordinating individual disks’ scheduling so that it be- 
comes worthwhile for a disk to wait for a future request. 
If the answer is yes, the question is how to know when 
there is such a potential before taking action for the co- 
ordination. As such an action usually entails postponing 
service of other applications’ requests, it could cause ex- 
cessive overhead and adversely affect performance if it did 
not produce the expected saving in disk seek time. 


1.4 Our Contributions 


In this paper we propose a light-weight framework that 
uses only requests’ characteristics, specifically requests’ 
arrival times and requested data locations, to turn any 
work-conserving scheduler into a non-work-conserving 
one. These request characteristics are readily available in 
any storage system and are employed in almost all disk 
schedulers. In summary, we make the following contribu- 
tions. 

First, instead of using the conventional method of di- 
rect analysis of locality to make a prediction about future 
requests, we propose to track the judicious actions, either 
waiting for future requests or seeking to a pending request, 
that should have been taken for greater disk efficiency. A 
judicious action is the one that helps improve disk effi- 
ciency, and may or may not have actually been taken in 
the prior scheduling. After observing a consistent pattern 
of judicious actions, our scheduling framework guides the 
scheduler to follow the trend in making its next decision. In 
the meantime, the framework retains the mechanism pro- 
vided by the corresponding work-conserving scheduler for 
avoiding long delay or even starvation in its request ser- 
vice. The framework is simple, efficient, effective, and 
minimally intrusive to the work-conserving scheduler. 

Second, we propose an efficient scheme for non-work- 
conserving scheduling for the disk array. To this end, we 
create a virtual disk corresponding to a disk array and apply 
our proposed framework on it to evaluate the potential ben- 
efit of coordinating scheduling across the disks for a par- 
ticular stream of requests. When the evaluation is positive, 
coordinated scheduling of all disks is conducted to make it 
possible for scheduling of future requests to be profitable. 

Third, we have implemented and evaluated the schedul- 
ing framework for single disks and for disk arrays, collec- 
tively named stream scheduling, in the Linux 2.6.31 and 
Linux software RAID MD. Our experiments on the proto- 


type system with a variety of benchmarks demonstrate its 
significant performance advantages. 

Section 2 of this paper details the design of stream 
scheduling. Section 3 presents an extensive experimental 
evaluation. Section 4 describes related work, and Section 
5 concludes. 


2 The design of Stream Scheduling 


While a non-work-conserving scheduler is designed to se- 
lect one request of the lowest cost from currently pend- 
ing requests and future requests, a key technique in the 
scheduling is the effective comparison of costs for serv- 
ing these two types of requests. Because future requests 
are not available for immediate dispatching, the scheduler 
keeps the disk idle for some period of time waiting for them 
if it decides to schedule a future request. Accordingly the 
cost for dispatching a future request is the sum of the wait 
time and the request’s service time, while the cost of dis- 
patching a pending request is just its service time. To effec- 
tively implement a non-work-conserving scheduler, there 
are two critical questions to answer: (1) how likely it is to 
see a future request whose cost is lower than that of the 
pending requests; and, (2) which future requests can be the 
candidates for selection. The answer to the first question 
determines whether a future request should be selected— 
whether the disk should wait—and the answer to the sec- 
ond question determines the threshold of the wait period 
beyond which no requests would be qualified. In the pro- 
posed framework it is the stream scheduling algorithm that 
answers the two questions by taking three inputs, namely 
request arrival time, arriving request location, and pending 
request location. 

When a scheduler is ready to dispatch a new request 
the stream scheduling algorithm makes the decision on 
whether or not to schedule a future request. If yes, it will 
leave the disk waiting for an incoming request of relatively 
strong locality. Otherwise, it will dispatch a pending re- 
quest selected by the working-conserving scheduling algo- 
rithm. As the stream scheduling algorithm makes its deci- 
sions independently of the working-conserving scheduling 
algorithm, the scheduling framework is applicable to any 
working-conserving scheduling algorithms. 


2.1 The Stream Scheduling Algorithm 


We consider a decision to make the disk wait for future 
requests a judicious one if there exists a future request 
R such that waittime(R) + servicetime(R) < 
service _time(selected_pending_request), where 
wait_time(R) is the time period from the time when 
the decision is made to the time when request R arrives, 
servicetime(R) is the time spent to serve request R, 
the first dispatched future request after the decision is 
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made, and service_time(selected_pending-request) 
is the service time for request selected by the work- 
conserving scheduling algorithm when the decision is 
made. If the inequality does not hold, the decision that 
demands immediate dispatching of a pending request is a 
judicious one. Note that the evaluation of the inequality 
cannot be completed until a future request satisfying 
the inequality actually arrives or until wait_time(R) > 
service_time(selected_pending_request) becomes true. 
To evaluate the inequality, the service time of a known 
request can be estimated according to the distance between 
the location of its requested data and current disk head 
position, which can be considered to be the location of the 
most recently served request [14, 16]. Therefore, no matter 
whether request selected_pending_request is actually 


dispatched, service_time(selected_pending_request) 
can be estimated. 

In the inequality only 
service_time(selected_pending_request) is known 


when the decision is being made, while wait_time(R) 
and servicetime(R) are unknown. Generally there 
are two methods to predict whether the inequality will 
hold. One is the method adopted by existing non-work- 
conserving schedulers, which use wait times and service 
times of previous requests that belong to the same process 
to predict these two times for the next request from the 
process, respectively. This method does not work when 
the process information is unavailable, because we do not 
know which previous requests and which future requests 
should be included in the evaluation of the inequality. To 
address the issue we propose the second method, which 
identifies a series of recently served requests for which the 
inequality held to form a so-called stream. A stream of 
sufficient size indicates that it is likely that the inequality 
would continue to hold and a judicious decision is to wait 
for future requests. 

Figure | illustrates how a stream is formed and how it 
is used for request scheduling. The figure shows the ar- 
rival and completion times of requests as well as the re- 
quests’ positions on the disk in terms of their requested 
data’s LBNs (Logical Block Numbers). When the sched- 
uler is notified that a request is completed is the time for 
the scheduler to select one request from currently pend- 
ing requests and eligible future requests, or requests sat- 
isfying the inequality. As we can see, the positions of 
pending requests determine the eligibility of future re- 
quests. This is what we expect. If there are nearby 
pending requests, the criteria to schedule a future re- 
quest must be more strict to make it profitable. Oth- 
erwise, it may be affordable for the disk to wait for a 
longer time and/or for a request with longer distance to 
the recently completed request. We may not come to 
a conclusion on whether a future request should be se- 
lected, or whether the scheduling decision is judicious, 
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Figure 1: Illustration of forming a stream and using the stream 
for scheduling. In the figure, the mushroom-shaped area ahead of 
each completed request describes the inequality on the eligibility 
of being a child request. The size of an area is determined by how 
close its corresponding pending requests are from the completed 
request. When a new request arrives in such an area, it becomes 
the child of the completed request associated with the area and 
extends the corresponding stream. As shown in the graph, the 
arrival of request 2 in the area following request 1 extends the 
stream to [1, 2]. When request 2 is completed, its area is cre- 
ated and the arrival of request 3 in the area further extends the 
stream to [1, 2, 3]. A stream cannot be established without new 
requests arriving in the defined areas, as shown in the upper part 
of the figure. In the lower part of the figure, before the stream 
is established, the disk head must leave and then seek back to 
serve its next request. When request 4 becomes a child request 
and joins the the stream, the stream is established (assuming that 
stream_threshold is 4). After this, the disk keeps serving requests 
in the stream (such as requests 5 and 6) for some period of time 
for high I/O efficiency. 


until service _time(selected_pending_request) after the 
decision is made. Note that the conclusion does not de- 
pend on what the actual decision is. If later on we do 
find a request arriving at a time and a position that sat- 
isfy the inequality, this request is called the child of the 
recently completed request. Therefore, for a request that 
is highly likely to have a child, the scheduler should wait 
for the child request, instead of immediately dispatching a 
pending request. To predict whether a recently completed 
request would have a child, we introduce the concept of 
stream, which is a sequence of requests [Ro, Ri, ..., Rn—1] 
that have arrived in time-ascending order. For any two ad- 
jacent requests (R,-1, Rx) in the stream, Ry is the child 
of R,_1. If the length of the stream is equal to or greater 
than a predefined threshold stream_threshold, the stream is 
considered established. 
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The assumption we make in the stream scheduling algo- 
rithm is that for an established stream [Ro, R1,..., Rn—1] 
(n > stream_threshold), request Rp, is highly likely 
to have its child request R,, extend the stream. The child 
request is the first one that arrives after the completion of 
Ry 1 and satisfies the inequality, and the the disk should 
wait for the child request. This assumption is consistent 
with those made by other non-work-conserving algorithms 
to estimate thinktime and seek time of a process’s next 
request. In addition, as we do not independently predict 
these two times, we can take the relationship between 
pending requests and future requests into account in the 
assumption. A disk waiting for a child request will stay 
idle for at most service_time(selected_pending_request) 
if there exist pending requests. The time when ser- 
vice_time(selected_pending_request) passes a request’s 
completion time is called the request’s deadline. Af- 
ter its deadline, it is not possible to find an eligible 
request to be the request’s child. If the most recent 
request in a stream fails to find its child request, the stream 
aborts. Pseudo code for the algorithm is shown in Figure 2. 


As shown in the pseudo code, when a request is com- 
pleted it is possible for it to become a parent of a future re- 
quest. So we insert the request into the parent-to-be queue 
to see if it would have a child that turns it into a parent. 
The queue is sorted by requests’ deadlines, and only re- 
quests whose deadlines are not yet passed remain in the 
queue. Therefore, the size of the queue is usually very 
small. If the recently completed request is at the head of 
an established stream, we let the disk wait for a future re- 
quest and in the meantime activate a timer for the com- 
pleted request. Note that the algorithm does not remember 
every member of a stream. Instead, it only needs to keep 
track of the most recent request of a stream as well as its 
current length. When a new request arrives, we examine 
requests in the parent-to-be queue to see if it can extend a 
stream. If a request in the queue reaches its deadline with- 
out seeing a new request as its child, the stream led by the 
request is usually abandoned. One exception is that when 
stream has been sufficiently long—when its size is larger 
than stream_threshold by a factor of tolerance factor, or 
50% by default—we give the stream a second chance to 
get extended. When the disk has kept serving a stream for 
more than a threshold time period (stream_time_slice), the 
disk will dispatch a selected pending request, instead of 
waiting for a future child request in the stream (not shown 
in the pseudo code). In our work, we leave the issue of 
fairness to the external scheduler that has process infor- 
mation, or to the local work-conserving scheduler, such as 
the Deadline scheduler. When Deadline boosts the priority 
for dispatching of requests that have waited for too long 
the stream algorithm respects the decision by immediately 
sending them to the disk. 
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/*x Procedure invoked upon completion of request Rx/ 
R.completion_time = current_time; 
R.position = LBN of data requested by R; 


/*x 'selected_pending_request’ is the request selected 
by the work-conserving algorithm «/ 

R.service_time =calculate_service_time(R.position, 

selected_pending_request.position) ; 

R.deadline = R.completion_time + R.service_time; 

/x insert R into the queue sorted by requests’ 
deadlines */ 

queue_of_parent_to_be <-- R; 

/* If the stream is established, wait for a 
potential child request «/ 

if (R.stream_size >= stream_threshold) { 
R.timer.timeout = R.service_time; 
activate R.timer; 

} else 
dispatch selected_pending_request; 


/* Procedure invoked upon arrival of request new_Rx/ 
new_R.arrival_time = new_R’s arrival time; 
new_R.position = LBN of data requested by new_R; 


for each request R in ’queue_of_parent_to_be’ { 
if (R.deadline < current_time) 
remove R out of the queue; 
if (new_R.arrival_time-R.completion_time+ 
calculate_service_time(R.position, new_R.position) 
< R.service_time) { 
/* new_R is R’s child */ 
if (R.stream_size >= stream_threshold) { 
turn off R’s timer; 
dispatch new_R; 
} 
new_R.stream_size = R.stream_size + 1; 
remove R from queue_of_parent_to_be; 
} 
else 
new_R.stream_size = 1; 


/* Procedure invoked upon expiration of 
request R’s timer «/ 
if (R.stream_size >= 
(1+tolerance_factor) *stream_threshold) { 
-timer.timeout = R.service_timextolerance_factor; 
.-service_time *= (1+tolerance_factor) ; 
-deadline = R.completion_time + R.service_time; 
.stream_size = stream_threshold; 
activate R.timer; 


AnD D 


} 
else 


remove R out of ’queue_of_parent_to_be’; 


Figure 2: Stream scheduling Algorithm. In the pseudo code, 
function calculate_service_time(disk_pos, req_pos) is used to cal- 
culate the service time when the disk head is at disk_pos and the 
requested data is at req_pos, all in terms of LBNs. While we re- 
member only the most recent member request of a stream and the 
size of a stream, we treat the size as an attribute of the request, 
denoted as R.stream_size. 
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The forming of streams and scheduling of requests are 
two independent procedures. That is, no matter what the 
scheduling decision is, the stream’s development is not af- 
fected. The forming of streams is determined by the ar- 
rival and location of future requests, which usually do not 
depend on whether the disk actually waits for a child re- 
quest, though the time period between a request’s arrival 
and its completion is determined by the scheduling deci- 
sion. Therefore, the stream scheduling algorithm can be 
used with any work-conserving scheduler. In addition, as 
the size of the parent-to-be queue is small, the algorithm is 
of low cost, specifically O(V), where N is the size of the 
queue. 


2.2 The Stream Scheduling Algorithm in a 
Disk Array 


The effectiveness of non-work-conserving scheduling al- 
gorithms depends on the existence of locality in the re- 
quests of a process or a stream. This locality can be suf- 
ficiently strong to form an established stream when it is 
presented to the entire storage system. However, when the 
storage system consists of an array of disks where data 
are striped, each disk only sees a subset of the requests 
and the locality presented to individual disks can be much 
weaker. As each disk has to be individually scheduled to 
accommodate its specific data layout and request pattern, 
instead of all disks being fully synchronized and using one 
request scheduler [19, 8], it would be hard for each sched- 
uler, on its own, to take advantage of the potential ben- 
efit of non-work-conserving scheduling. As an example, 
for a sequence of synchronous requests [Ro, R1,..., Rn—1], 
which could be a stream if they were all served by a sin- 
gle disk, let us assume that only requests R; (i mod m = 
k) reach disk k, where m is the number of disks in the ar- 
ray (k = 0,1,...,m — 1). After serving Ro, disk 0 would 
not see R,,, until R,, Ro,..., and R,,_; have been served 
by other disks, whose service times depend on their re- 
spective scheduling decisions and could be significant if 
long-distance seeks are involved. Even worse, when one 
request has to access data spread on multiple disks, it is 
not completed until the last piece of the data is served, and 
the request’s service time can be long if the disks are not 
coordinated to serve it quickly. 

The time period between completion of a request and 
arrival of the next request of a stream observed at one par- 
ticular disk (such as completion of Ro and arrival of R,, at 
disk 0 in the example) consists of two types of time com- 
ponents. One is thinktime, or the time period from the 
completion of one request to the arrival of the next one 
of the stream observed by the disk array (such as comple- 
tion of Ro and arrival of R, in the example stream); an- 
other is response time, or the time period from the arrival 
to the completion of a request in the stream. A request’s 
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response time consists of its wait time and service time. To 
enable non-work-conserving scheduling, we need to min- 
imize the time period for a disk to see its potential child 
request. While the involved thinktimes cannot be reduced 
for synchronous requests, the response time can be reduced 
by dedicating all disks to serving requests of a stream dur- 
ing a certain time period through disk coordination. 

As we do not have process information, we set up a disk- 
array scheduler that treats the disk array as one big virtual 
disk and uses the method described in the stream schedul- 
ing algorithm to identify streams. The disk-array sched- 
uler uses the array’s logical addresses for calculating ser- 
vice times and uses pending requests on respective phys- 
ical disks to evaluate the inequality for identifying child 
requests. The stream threshold for established streams is 
increased by m times, where m is the number of disks. 
Once a stream is established in the virtual disk, which we 
call a virtual stream, we attempt to find a stream on each 
physical disk corresponding to the virtual stream, which 
we call physical stream. Without dedicating all disks to 
the virtual stream, there is little chance for a physical disk 
to see its corresponding physical stream because of high 
response times. However, forcing all disks to serve only 
the virtual stream’s requests before knowing whether the 
physical streams can be formed runs the risk of idling mul- 
tiple disks for an excessively long time. 

To address the challenge, we do not use a request’s ac- 
tual arrival time to determine whether it can extend a phys- 
ical stream at a physical disk, as this time might be sig- 
nificantly reduced if all disks were dedicated to the corre- 
sponding virtual stream. Instead, we use the arrival time 
less the response times between the completed request and 
the disk’s next request in the virtual stream (such as the 
arrival time of R,,, minus the sum of response times of re- 
quests R; (1 < k < m-— 1) in the example stream). The 
physical streams formed in this way represent the most 
optimistic estimates on future requests’ arrival times, be- 
cause the response times cannot be reduced to zero even 
if all disks are dedicated to the virtual stream. Once the 
array scheduler finds that physical streams have been es- 
tablished on all the disks for a particular virtual stream, it 
marks the virtual stream’s next request to each disk as ur- 
gent so that it can be dispatched immediately to bring each 
disk head to the corresponding physical stream. After this, 
the array’s scheduler instructs each disk’s scheduler to use 
their respective physical stream for non-work-conserving 
scheduling and use the actual request arrival time to extend 
the stream. In this way, the non-work-conserving schedul- 
ing is certain to be cost-effective even though the physical 
streams are initiated with optimistic estimates of request 
arrival times. When a disk’s physical stream is broken be- 
cause it fails to find its next child request, this phenomenon 
usually cascades to other disks as it would cause other 
disks’ streams to take longer time to see their respective 


USENIX Association 


USENIX Association 


next requests. When the array’s scheduler observes broken 
physical streams, it will mark the virtual stream as unus- 
able. Note the scheduler will keep maintaining the virtual 
stream to prevent a new stream from being formed and trig- 
gering non-work-conserving scheduling on the disks once 
again, which has been shown not to be cost effective. For 
the disk array, instead of letting each disk decide how long 
it continuously serves a physical stream, we let the array 
scheduler determine the time period during which each 
disk is supposed to serve its physical stream corresponding 
to the virtual stream. In this way the serving of requests in 
a virtual stream is fully coordinated across the disks. 


3 Performance Evaluation 


To evaluate the performance of the stream schedul- 
ing framework, we implemented it in the Linux kernel 
2.6.31.3, either as a wrapper of a work-conserving disk 
scheduler to create a stream scheduler for individual disks, 
or as a revised implementation of the Linux software RAID 
mdadm for a disk array. In the experiments the CPU is an 
Intel Core2 Duo with 2GB DRAM memory and the disks 
are 7200RPM, 500GB Western Digital Caviar Blue SATA 
Il (WDSO00AAKS) with a 16MB built-in cache. The disk 
array has five disks connected to the host via a RAID card 
(RocketRAID 2320). 


3.1 Disk Schedulers in Linux 


Currently there are four configurable disk scheduler mod- 
ules in the Linux distributions, each implementing a com- 
monly used scheduler: Noop, Deadline, AS (or Antic- 
ipatory), and CFQ. Among them, Noop and Deadline 
are work-conserving while the other two are non-work- 
conserving. Noop simply dispatches a request as soon as 
it is received and does nothing beyond merging contiguous 
requests. Though it does not sound meaningful when the 
scheduler is used for dispatching requests directly to the 
hard disk, it is actually the preferred choice in other cases, 
such as in guest VMs of virtual machines and the systems 
using the SAN block device. This not only saves CPU cy- 
cles but also allows the requests to reach the lower level 
as early as possible, where a scheduler can see requests 
from different guest VMs or hosts and know how data 
are actually laid out on the disk(s) [32]. For this reason, 
we include Noop in the evaluation. Deadline is a sched- 
uler approximating CSCAN augmented with a deadline- 
enforcement mechanism to prevent starvation. AS is a 
deadline scheduler enhanced with the anticipatory capabil- 
ity to wait for a future request that is of strong locality and 
is issued by the same process. CFQ aims to fairly distribute 
disk time among I/O-intensive processes and to bound re- 
quest response time as Deadline does. As CFQ allows the 


disk to be idle waiting for future requests, it is non-work- 
conserving. 


3.2 The Stream Scheduling in Linux 


In the implementation we place Deadline in the stream 
scheduling framework and turn it into a non-work- 
conserving scheduler, the stream scheduler (SS). To 
accommodate the starvation avoidance mechanism, the 
stream scheduling algorithm respects the decision made 
by Deadline about immediate dispatching of expired re- 
quests by suspending its dedicated service to a stream. In 
the evaluation we set stream_threshold to be 4. We set 
stream_time_slice to 124ms if not stated otherwise, that is, 
a stream can be uninterruptedly served for at most 124ms 
if there are other pending requests in the system. This set- 
ting is consistent with that in AS for continuous requests 
from one process. We will present results of a sensitivity 
study on the parameter in Section 3.6. 

Today’s hard disks store multiple requests pending in it 
and enables its own scheduler such as NCQ for internal 
scheduling. The disk will continue serving requests pend- 
ing in it after it completes a request. This poses a challenge 
to the implementation of the stream scheduling framework 
because the location of the most recently completed re- 
quest is not necessarily the disk head position when the re- 
quest it will dispatch next gets served. For example, when 
SS decides to idle the disk to wait for a future request by 
suspending dispatching requests, it assumes that the disk 
head will stay where it is. However, in a hard disk with 
stored pending requests, the disk head may have sought to 
another pending request scheduled by NCQ. To address the 
issue, we make a customization of the SS algorithm. In the 
kernel, there is a FIFO queue (struct request_queue), into 
which the disk scheduler dispatches its requests and from 
which the disk driver takes requests to the disk hardware. 
In other words, the actual service order will be basically 
consistent to the order in which the requests stay in the 
queue, assuming NCQ does not make a major change in the 
order. Accordingly, the disk head position when the next 
request is dispatched can be best indicated by the request 
at the queue tail, or the most recently inserted request. For 
this reason, SS makes a scheduling decision for the tail re- 
quest when it is added into the queue, or considers it as 
the completed request in the stream scheduling algorithm, 
instead of for the actually completed request. If the deci- 
sion is to wait for a future request, none of the currently 
pending requests are allowed to get into the queue and the 
corresponding timer will be activated at this time. In this 
way, the assumption made by SS about the disk head loca- 
tion still holds. 

To estimate the service time of a request when the disk 
head is at disk_pos and the request is at req_pos, all in terms 
of LBNs (calculate_service_time(disk_pos, req_pos)), we 
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adopted a simple empirical method which has been widely 
used for its effectiveness [25, 14, 16]. In this method, 
requests of various distances between two adjacent ones 
are sent to the disk and corresponding service times are 
collected. A smooth curve is fit through the measured 
[distance, time] data points and is used to represent cal- 
culate_service_time() function. In addition, as CSCAN 
prefers to serve requests in the forward direction, for the 
same inter-request distance we increase the cost of back- 
ward access by 50%. 


3.3. Storage without Process Information 
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Figure 3: Performance of benchmarks par-read, grep, Post- 
Mark, and TPC-H with different disk schedulers (SS, AS, CFQ, 
Deadline, and Noop) when the process information of requests 
is removed from the workloads. The performance is presented 
as the schedulers’ percentage improvement over that of Noop. 
For par-read and PostMark the performance is measured with 
throughputs, which are 16.0MB/s and 815.9KB/s, respectively 
for Noop. For grep and TPC-H the performance is measured 
with execution times, which are 73.5s and 228.2s, respectively, 
for Noop. 


We first evaluate schedulers of storage systems for 
which process information for requests is not available, 
such as hardware RAID, SAN, and iSCSI connected stor- 
age devices. As the devices usually use proprietary soft- 
ware and their internal disk schedulers are not open- 
sourced for instrumentation, we hide process context in- 
formation from the schedulers, or equivalently we make 
the schedulers believe that all requests are issued by the 
same process. In this section, we discuss the experimen- 
tal results for one disk, and leave those for disk arrays to 
Section 3.5. 

The benchmarks we use in this experiment are par-read, 
grep, PostMark, and TPC-H. par-read is a microbench- 
mark we wrote to study the impact of varying thinktime 
on the schedulers’ performance. It creates four indepen- 
dent processes, each reading a 1GB file using 4KB re- 
quests in parallel. There is a 50GB gap between each two 
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adjacent files. By default the thinktime between consecu- 
tive requests of a process is set to 0. grep is a Linux text 
search program we run to look for a non-existent word in 
the Linux 2.6.31 source code tree so that the entire direc- 
tory tree is read. In the experiment we run two greps, each 
reading one of two copies of the Linux directory with a 
50GB gap between them. PostMark is to measure the per- 
formance of an Internet server running e-mail, netnews, or 
e-commerce applications, where random access of small 
files is the dominant access pattern [26]. In the experi- 
ment, we run four PostMark benchmarks (version 1.5.1), 
each creating a data set consisting of 10,000 files whose 
sizes are in the range between 0.5KB and 10KB. Each data 
set is 50GB away from the next data set. TPC-H is a deci- 
sion support benchmark that processes business-oriented 
queries against a database system to examine large vol- 
umes of data. In our experiment we use PostgreSQL 8.3.7 
as the database server and use DBT3 1.5.0 to create ta- 
bles in it. We choose the scale factor 1 to generate the 
database and run query 19 against it. We run three TPC-H 
instances, with a 5OGB space gap between adjacent data 
sets. Figure 3 shows the performance improvements of the 
four schedulers (SS, AS, CFQ, and Deadline) over Noop 
for the four benchmarks. 

The experiments demonstrate that without process infor- 
mation both AS and CFQ lose the performance advantages 
they had enjoyed when they knew which requests are is- 
sued by the same process. Each process in the benchmarks 
synchronously issues its requests. For benchmarks grep 
and PostMark, which issue random requests and generally 
do not trigger prefetching in the operating system, the disk 
scheduler can see at most one request from a process at 
atime. Without seeing a nearby pending request, Dead- 
line would dispatch a remote one and constantly move the 
disk head between remote data sets. This causes its per- 
formance to be as low as Noop. Without knowing which 
process actually issues a request, AS and CFQ assume all 
requests are from the same process and serve any pend- 
ing requests when they see them, even if they are in dis- 
tant regions. Consequently, they degenerate into work- 
conserving schedulers such as Deadline. However, if we 
let the information available to AS and CFQ in the exper- 
iments, they would perform as well as SS (with a perfor- 
mance difference less than 3%), demonstrating the impor- 
tance of non-work-conserving scheduling. 

Interestingly, the observations for random access can 
also be made on the other two benchmarks issuing sequen- 
tial requests, which triggers prefetching in the operating 
system and allows the scheduler to see asynchronously is- 
sued requests. The condition for a work-conserving sched- 
uler to keep serving one process’s requests is to eliminate 
quiet periods in the process’s I/O service, or the time period 
during which the scheduler does not see any requests from 
the process since last time when the scheduler attempts to 
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Figure 4: Throughputs of par-read with varying thinktimes, the 
time period between two continuous requests issued by a process. 


dispatch this process’s request. However, prefetching does 
not eliminate quiet periods in the system for two reasons. 
First, Linux maintains two readahead windows to prefetch 
file data. Prefetch requests issued for one window are con- 
tiguous and sent to the scheduler together. The scheduler 
has a good chance to merge them into one request. Con- 
sequently, the next prefetch request would not be triggered 
and sent to the scheduler until this request is completed and 
its data is consumed by the process. Second, as today’s 
hard disks store multiple pending requests, a scheduling 
decision may have to be made before the process’s request 
is completed. At this moment, it is likely the process’s next 
prefetch request has not been generated, creating a quiet 
period. In both cases, Deadline, as well as AS and CFQ 
when process information is unavailable, would schedule 
other process’s request and thrash the disk head among 
processes. While increasing the prefetch window can re- 
duce number of quiet periods, they are unlikely to be fully 
removed. While SS does not rely on process information, 
its performance advantage is impressive with about 3.2X 
throughput improvement over the other schedulers. If we 
increase the thinktime, the performance improvement of 
SS becomes increasingly small as their wait times become 
larger (shown in Figure 4). When the thinktime is as large 
as 200us, the corresponding quiet periods increase to as 
large as about 8.5ms, which causes streams to break and 
accordingly causes SS to stop waiting for future requests 
and behave like Deadline. 


3.4 Storage with Inadequate Process Infor- 
mation 


Next we consider four benchmarks running in an envi- 
ronment where the process information is inadequate or 
misleading. To investigate how synchronization of I/O- 
intensive threads affects behaviors of disk schedulers, we 
wrote a microbenchmark called multi-threads, in which 
there are four processes, each forking two threads. Each 
thread reads a 40MB file in a strided pattern, reading the 
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Figure 5: Performance of benchmarks multi-threads, PVFS, 
ProFTPD, and TPC-H with different disk schedulers. ProFTPD 
and TPC-H run either on one virtual machine or on two virtual 
machines. The performance is presented as the schedulers’ per- 
centage improvement over that of Noop. For multi-threads, TPC- 
H(1VM), and TPC-H(2VM) the performance is measured with 
execution times, which are 65.7s, 231.4s, and 332.0s, respec- 
tively, with Noop. The performance of PVFS, ProFTPD(1VM), 
and ProFTPD(2VM), is measured with throughputs, which are 
132.0MB/s, 17.1MB/s, and 12.5MB/s, respectively, with Noop. 


first 4KB of data of every 16KB segment from the begin- 
ning to the end of the file. The distance between the two 
files accessed by one process is 1OOMB, and the distance of 
files read by adjacent processes is SOGB. Two threads of a 
process synchronizes after each makes every five requests. 
The performance improvements of the schedulers for the 
benchmark over that of Noop are presented in Figure 5. 
We can see that SS more than doubles the performance of 
Deadline in terms of reduction of execution time. Unfortu- 
nately AS and CFQ deliver performance even worse than 
that of Noop. The reason is that the synchronization dis- 
rupts their non-work-conserving scheduling, which is un- 
necessarily tied to the process. For example, assuming that 
two threads of a process are T’4 and Tp, AS keeps serv- 
ing requests from 7’, by anticipatory wait until T’4 reaches 
a synchronization point. Then AS has to wait for about 
4ms until its timer expires and then it starts to serves T’p’s 
requests, even though a T’p’s request is pending nearby. 
In Linux a thread is presented as a light-weight process. 
Because the nearby pending request belongs to another 
thread, AS does not immediately dispatch it. Instead it 
suffers a long and unfruitful wait. In comparison, without 
relying on the process information SS is not constrained by 
the synchronization and dispatches any nearby requests. 
PVFS is a parallel file system widely used in high- 
performance computing clusters [9]. We run the mpi- 
io-test program, an MPI-IO benchmark from the PVFS2 
software package [30], on PVFS 2.8.2. The cluster has 
four compute nodes and eight data servers, where files are 
striped with a 64KB striping unit. Each data servers has 
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128 


a SATA disk (Seagate Barracuda 7200.10) with NCQ en- 
abled. We run four such programs, each reading a distinct 
file with 10GB space in between. Each program has eight 
MPI processes, two per compute node, to read or write one 
10GB file. The processes take turns reading 64KB blocks 
of data sequentially from a 1GB file. For a particular data 
server, while requests from the same program have strong 
locality and SS can exploit the locality and achieve an im- 
provement of aggregate throughput for all four MPI pro- 
grams by 87% over Deadline or Noop, AS and CEQ seri- 
ously underperform (Figure 5). On each PVFS server there 
is a daemon called pvfs2-server accepting requests from 
compute nodes. To achieve asynchrony in its service, the 
daemon maintains a pool of threads and uses any available 
thread to dispatch its requests to the kernel. Consequently, 
AS or CFQ see requests associated with essentially ran- 
domly assigned thread numbers and can hardly recognize 
the locality within requests from the same thread, which 
leads to disk head thrashing among blocks of different files. 

Xen is a virtual machine monitor that allows multiple 
guest virtual machines (VMs) to run on it [3]. In Xen, 
guest VMs send requests to their respective virtual block 
devices, which use the b/ktap mechanism to pass the re- 
quests to the kernel driver in the host VM, a privileged vir- 
tual machine that does the actual dispatch of I/O requests to 
disk. In the experiment we run two benchmarks, ProFTPD 
1.3.1 and TPC-H, on Xen 4.0.1-rc6 to evaluate the disk 
scheduler in the host VM while leaving the schedulers in 
the guest VMs as Noop to quickly release requests into the 
host VM. ProFTPD is an FTP server [28]. In the test, we 
run a ProFTPD instance on each guest VM to serve four 
clients simultaneously downloading four 300MB files, re- 
spectively. There are 20GB space gaps between the files. 
For TPC-H, we use the same experimental setting for each 
guest VM as described in Section 3.3. From the exper- 
imental results shown in Figure 5 we see that SS signif- 
icantly improves throughput, while AS and CFQ exhibit 
only limited, if any, improvements over Deadline and Noop 
because of their lack of process information about requests 
issued by processes on the same guest VM. When we run 
two guest VMs, each of the same setting as that in the one- 
VM scenario, AS and CFQ produce higher throughput im- 
provement as they can differentiate requests from differ- 
ent guest VMs and thus reduce long-distance seeks among 
data requested by different VMs. Accordingly the relative 
performance advantage of SS is reduced. 


3.5 Storage with Disk Array 


To evaluate the performance impact of disk schedulers on 
the disk array, we select three benchmarks: par-read, TPC- 
H, and PostMark, whose settings are the same as described 
in Section 3.3, except that all files are striped over five disks 
with a 64KB striping unit. The disk array is organized 
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Figure 6: Performance of benchmarks par-read, TPC-H, and 
PostMark, with different disk schedulers in a 5-disk array. The 
performance is presented as the schedulers’ percentage improve- 
ment over that of Noop. Performance of TPC-H is measured with 
execution time, which is 104.6s with Noop. For par-read and 
PostMark, it is measured with throughputs, which are 168.0MB/s 
and 1.3MB/s, respectively, with Noop 


as RAIDO. We have also experimented with RAIDS and 
obtained consistent results. To focus on the performance 
challenges imposed by data striping on the disk array, we 
do not hide process information in the test. The experimen- 
tal results are presented in Figure 6, which shows that for 
benchmarks of sequential access pattern, such as (par-read 
and TPC-H), SS achieves impressive improvements, 114% 
and 174% over that of Noop, respectively. Without op- 
portunistic synchronization of the disks, the improvements 
made by AS or CFQ are limited. For example, AS reduces 
the execution time of TPC-H by only 25% while it can re- 
duce the time by 72% when only one disk is used over 
that of Deadline (see the measurement in Figure 3 for SS, 
which produces about the same execution time as AS with 
known process information). The throughput of par-read 
with SS (361MB/s) approaches the peak throughput of the 
RAID card (around 400MB/s). The sequential access pat- 
tern with the help of aggressive prefetching in the RAID 
is turned into streams on each physical disk in SS, which 
helps eliminate disk thrashing. However, with the random 
access pattern of PostMark, SS shows minimal improve- 
ment as physical streams can hardly be formed. 


3.6 Impact of Stream Scheduling on 
Throughput and Response Time 


SS achieves its performance advantage mostly through its 
dedication of disk service to one stream of requests dur- 
ing a certain period of time (stream_time_slice). By doing 
so, potentially long distance disk seeks take place only be- 
tween time slices. Therefore, increasing the time slice is 
expected to reduce long-distance seeks and thus improve 
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Figure 7: Impact of streaming scheduling on throughput improvement and variation of request response time. (a) Throughputs 


with varying stream time slices for benchmark par-read of different thinktimes. (b) Request wait times with SS of default time slice 
(124ms) for par-read of 0 thinktime with SS and Deadline. (c) Request service times with SS of default time slice (124ms) for 


par-read of 0 thinktime for SS and Deadline. 


1/O throughput. However, requests that are pending but do 
not belong to the currently served stream may experience 
a longer pending period with increased time slice, which 
can increase variation in response time. To study the ef- 
fect of the time slice on throughput and response time, 
we run benchmark par-read with an experimental setting 
the same as described in Section 3.3. As shown in Fig- 
ure 7(a), the throughput improves with the increasing time 
slice. The more I/O intensive (with a smaller thinktime) 
the program is, the larger the improvement. The through- 
put improves quickly with I/O-intensive programs before 
the time slice reaches 100ms. After that, further increas- 
ing the slice yields only diminishing returns. This is why 
SS uses the default time slice of 124ms, the same value as 
adopted by Linux’s AS. With this time slice, we measure 
two components of every request’s response time, namely 
wait time and service time, during the execution of par- 
read with zero thinktime, and show them for the first four 
seconds of execution with SS and Deadline in Figure 7(b) 
and Figure 7(c), respectively. Unsurprisingly, SS produces 
some substantially large wait times (as large as 0.37s), as 
it rotates its service among four streams with a 124ms 
slice. Considering that Deadline’s default timeout period 
for boosting request priority is 0.5s, these wait times are 
deemed acceptable. Meanwhile, as each cycle of such ro- 
tation produces only a few long wait times for synchronous 
requests, the percentage of requests with long wait times 
is very small and most requests have significantly reduced 
wait times with SS (Figure 7(b)). Furthermore, the use of 
a modest time slice in SS, which increases variation of re- 
sponse time, is paid off with significantly reduced request 
service time (Figure 7(c)) and improved disk efficiency. 


4 Related Work 


The effectiveness of disk scheduling is highly dependent 
on the existence of request locality. For this reason, there 


are many efforts to improve disk access locality. In the 
high-performance computing field many optimizations are 
made in the middleware to transform a large number of 
small non-contiguous requests into a smaller number of 
larger contiguous requests, including Data sieving [34], 
Datatype I/O [6], and Collective I/O [34, 43]. Because lo- 
cality is about requested data locations on disk, there are 
many efforts to rearrange on-disk data layout to improve 
spatial locality, including data relocation [15] or data repli- 
cation, either within one disk [14, 4, 20] or across mul- 
tiple disks [42]. In addition, compiler techniques can be 
employed to improve locality by forming preferable I/O 
access patterns for the disks as well as optimizing file 
layouts matching known access patterns [18, 21]. How- 
ever, the enhanced locality can be weakened or even lost 
when there are multiple processes, each concurrently issu- 
ing synchronous I/O requests. The locality can be recov- 
ered by non-work-conserving disk schedulers, such as the 
Anticipatory Scheduler [16]. Anticipatory scheduling has 
been implemented in some popular Linux disk schedulers 
including anticipatory [24] and (CFQ) [1]. 

The problem with the assumption by existing non-work- 
conserving schedulers on the availability of process infor- 
mation has been recognized in the literature, but effective 
solutions have not yet been proposed. One scenario is that 
the disk scheduler in the virtual machine monitor, such as 
AS, does not know from which specific process running on 
a guest virtual machine a request is issued. The Antfarm 
facility can help infer process information for disk schedul- 
ing by tracking activities of OS processes [17]. However, 
application of the technique is limited in the virtual ma- 
chine environment. In addition, effort must be expended to 
implement the facility for each individual virtual machine 
system and the system must be open for instrumentation 
and patching. The difficulty caused by the lack of pro- 
cess information has also been found with the AS sched- 
uler deployed in the NFS server [11], where the proposed 
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yproach is to use other access context information, such 
. accessed files’ directory or owner, as hints to group re- 
1ests for scheduling. While this approach can make up for 
e inadequacy to some extent, the hints may not be always 
levant in revealing on-disk locality to the scheduler and 
yuld be misleading. A study of the Linux disk schedulers 
und that AS or CFQ can underperform significantly even 
hen process information is available but multiple pro- 
‘sses cooperatively send synchronous requests, because 
S or CFQ may fail to find anticipation opportunity when 
attempts to attribute history access statistics to individ- 
ul processes [36]. By identifying access streams for non- 
ork-conserving scheduling directly from the access loca- 
ms, SS discards the requirement for process information 
stead of looking for its possibly inadequate substitutes 
ith additional overhead in the OS or file systems. 

The use of an I/O stream, or request sequence, to ana- 
ze and exploit access locality has been used before. Re- 
uding I/O prefetching, though many sophisticated de- 
gns have been proposed, such as those based on proba- 
lity graph model [38], information-theoretic Lempel-Ziv 
gorithm [7], or time series model [37], the stream-based 
yproach dominates the design of prefetching in the system 
id has proven its effectiveness and efficiency [27, 41, 35]. 
reams are also formed on the hard disk addresses to track 
sk access history and enable on-disk prefecthing [12]. 
nother interesting work is a tool called C-Miner that uses 
data mining technique to find streams of disk block ac- 
‘$8 representing repeatable block sequences, which can 
> used for initiating reliable prefetching [22]. While SS 
so tries to form streams among requests to the disk, the 
reams serve a different purpose. For prefetching, a well- 
tablished stream will lead to prefetching of multiple data 
ocks ahead of stream, while for SS the stream is main- 
ined to determine whether the disk should wait for an 
ycoming request. More importantly, the cost of using 
reams in the aforementioned works can be much higher 
an that for SS when stream members have to be remem- 
»red for evaluation of stream quality, while SS needs only 
track the latest member of a stream. 

Regarding scheduling in the disk array, the necessity of 
yordinating requests has been widely recognized, espe- 
ally for those with small striping units. When multiple 
sks are involved to serve a request, “disks take differ- 
it amounts of time to position, the request must wait for 
e slowest-positioning disk to transfer its data” [10]. A 
yssible solution is a synchronized interleaved disk sys- 
m that synchronizes disk spindles and serves one request 
a time in a disk array [19, 8]. However, for striping 
rit size larger than one byte or for a number of disks in 
disk system beyond a certain limit, a fully synchronized 
sk array could seriously hurt performance by limiting the 
umber of concurrently served requests [31]. The inter- 
rence among requests from different processes caused 
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by uncoordinated disk access has been reported and ad- 
dressed in the cluster-based storage environment by using a 
timeslice-based co-scheduling method [40]. Though their 
work is similar to ours in the coordination of some or all 
disks and dedication of them to one process at a time, it 
cannot be effectively used as a disk scheduler to exploit 
spatial locality for higher performance. One reason is that 
their work requires an offline-calculated scheduling plan 
according to QoS specifications that does not adapt to the 
workload dynamics. Another reason is that it does not eval- 
uate the benefits of dedicated service to a process relative 
to the cost of disk synchronization, and indiscriminately 
applies the synchronization to all programs. In contrast, SS 
dynamically evaluates the cost effectiveness of non-work- 
conserving scheduling by tracking and validating streams 
and opportunistically allows the disks to serve one virtual 
stream at a time. A scheme using opportunistic synchro- 
nization to reduce I/O interference among multiple MPI 
programs accessing a cluster of data servers has been pro- 
posed [44]. Without identifying streams, the scheme must 
assume a file is accessed by only one program and the MPI 
library and parallel file system must be instrumented to in- 
fer the assumed relationship and make it available to the 
scheduler. In contrast, SS provides a more general solution 
not constrained by availability of process information. 


5 Conclusions 


We have described the design and implementation of 
a stream scheduling framework that turns any work- 
conserving disk scheduler into a non-work-conserving one, 
even without process information available, to exploit lo- 
cality embedded in the sequences of synchronous requests. 
The framework can also opportunistically coordinate the 
services at different disks of a disk array to recover and 
exploit the locality weakened by file striping. The frame- 
work has been prototyped in the Linux kernel, both as a 
disk scheduler and as a software RAID scheduler. Exten- 
sive experiments have demonstrated that SS can signifi- 
cantly improve the performance of representative bench- 
marks such as by TPC-H, PostMark, grep, FTP, as well as 
MPI programs. In particular, SS shows its unique value 
in environments where process information is unavailable, 
such as block or file storage servers and virtual machines. 
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Abstract 


This paper introduces proximal I/O, a new technique for 
improving random disk I/O performance in file systems. 
The key enabling technology for proximal I/O is the abil- 
ity of disk drives to retire multiple I/Os, spread across 
dozens of tracks, in a single revolution. Compared to 
traditional update-in-place or write-anywhere file sys- 
tems, this technique can provide a nearly seven-fold im- 
provement in random I/O performance while maintain- 
ing (near) sequential on-disk layout. This paper quan- 
tifies proximal I/O performance and proposes a simple 
data layout engine that uses a flash memory-based write 
cache to aggregate random updates until they have suf- 
ficient density to exploit proximal I/O. The results show 
that with cache of just 1% of the overall disk-based stor- 
age capacity, it is possible to service 5.3 user I/O requests 
per revolution for random updates workload. On an aged 
file system, the layout can sustain serial read bandwidth 
within 3% of the best case. Despite using flash memory, 
the overall system cost is just one third of that of a sys- 
tem with the requisite number of spindles to achieve the 
equivalent number of random I/O operations. 


1 Introduction 


This paper focuses on an important but neglected as- 
pect of file system performance: workloads that mix ran- 
dom writes with sequential reads to the same data. In 
particular, serial reads after random writes (SRARW) 
are common in many applications that are large con- 
sumers of storage in enterprise environments. For exam- 
ple, database systems typically acquire and update data 
through online transactional processing (OLTP), which 
is dominated by small writes, and subsequently read it in 
bulk for other tasks, such as analysis or backup. SRARW 
workloads are particularly problematic in large-scale de- 
ployments, which are often spindle-limited and too large 
to be moved to flash-based SSDs cost effectively. 


Existing file system designs optimize either the serial 
read access or the random writes in a SRARW work- 
load, but do so at the expense of the other operation type. 
At one end of the spectrum, write-anywhere file systems 
such as the Sprite log-structured file system (LFS) [27] 
and its descendants [19, 22, 3] are write optimized. They 
batch small or random writes into larger sequential allo- 


cations on disk, transforming updates of logically unre- 
lated data into physically sequential I/O. However as they 
age, their data layout becomes fragmented, scattering re- 
lated data across the disk. Thus, logically sequential ac- 
cess such as a database table scan leads to inefficient disk 
I/O. We have measured access to physically fragmented 
data at as little as 3% of the best-case serial read band- 
width. (See results in Section 5.3.) 


At the other end of the spectrum, update-in-place file 
systems, such as FFS [21] and related designs [5, 32] 
are optimized for serial read and write access. These 
file systems attempt to allocate logically sequential data 
to physically sequential disk locations, providing good 
bandwidth for serial data access. However, this trans- 
lates into inefficient non-sequential I/O, as destination 
blocks are predetermined by past allocation decisions, 
which are unlikely to be optimal in the face of random 
updates. Moreover, when such systems keep older ver- 
sions of the data, they must perform a variant of copy- 
on-write [25], doubling the amount of inefficient random 
write I/O. Database systems often decouple this ineffi- 
cient back-end I/O from foreground processing through 
the use of logging. The log then becomes a staging area 
for asynchronous bulk updates to the database tables. 
However, this technique alone does not mitigate the high 
cost of random I/Os to the back-end of a storage system 
that has limited I/O capacity. 


To increase the back-end’s effective I/O capacity with- 
out increasing disk (spindle) count, we introduce a new 
type of disk access pattern that we term proximal I/O. 
We demonstrate how proximal I/O leverages features of 
current disk drives to retire in a single revolution several 
I/Os scattered across dozens of tracks holding hundreds 
of thousands of sectors (Section 2). We propose a new 
data layout (Section 3), which shares the desired prop- 
erties of existing copy-on-write, write-anywhere file sys- 
tems that make random user writes efficient and allow for 
snapshots with minimal I/O overhead. 


Using a prototype implementation of our data layout (de- 
tailed in Section 4), we show that with write cache sized 
only at 1% of the overall storage capacity, we can ser- 
vice 5.3 I/Os per revolution for workloads with random 
updates, all the while maintaining data layout on a heav- 
ily aged system that can deliver 97% of the bandwidth 
achieved with the best-case scenario of physically se- 
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quential layout (Section 5). The 5.3 I/Os serviced per 
revolution represent a 7x improvement over the cost of a 
random disk I/O in traditional frame arrays with update- 
in-place data layout. With RAIDS and non-volatile write 
caches, they use four random disk I/Os at the back-end; 
with copy-on-write snapshots, this number doubles. In 
contrast, our approach uses about one disk I/O at the back 
end for every user-level random update when both RAID 
and copy-on-write are employed. 


The primary contribution of this paper is the introduction 
of a data layout strategy that combines a small amount 
of flash memory with proximal I/O to efficiently service 
random updates to sequentially-allocated on-disk data 
without undermining on-disk locality. We provide effi- 
ciency both in performance and cost, significantly im- 
proving the performance of random writes at less than a 
third of the cost per IOPS of a pure disk solution. Fi- 
nally, we present the first study to quantify the behavior 
of modern disk drives under proximal I/O access pattern. 


2 Proximal I/O 


Proximal I/O leverages the ability of modern disk drives 
to execute multiple I/Os per single revolution scattered 
across dozens of disk tracks. Given the 300-400 Gb/in? 
aerial density of the magnetic media in currently ship- 
ping disk drives, this translates to a range of hundreds of 
thousands of logical blocks (LBN s)!. We describe the in- 
terplay between seek-time profile and request scheduling 
that make proximal I/O possible. 


In the material presented here, we focus primarily on one 
disk model (the Seagate Cheetah 15K.5, introduced to 
the market in late 2007). However, both the 15,000 RPM 
enterprise-class drives as well as the 7,200 RPM high- 
capacity nearline drives, colloquially referred to by their 
interfaces as respectively SCSI/FC and SATA drives, are 
capable of proximal I/O. Our measurements of over 20 
different models of both drive types, representing several 
generations of the same family of drives and manufac- 
tured by four different vendors confirm the observations 
about proximal I/O described here. 


2.1 Relevant technologies 


Historically the seek profile, the plot of seek time as a 
function of radial distance, has been described by a con- 
tinuous curve with two components: for small distances, 
one that is a square root of the cylinder distance and, 


'The number of sectors per track (SPT) for recent 3.5” disk drives 
ranges between 800 and 2800. edge. The 15,000 RPM enterprise-class 
disks employing 65 mm platters have fewer SPT at their outer-most 
track compared to the 95 mm platters in the 7,200 RPM disk drives [2]. 
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for larger seek distances, one where seek time is a lin- 
ear function of cylinder distance [28]. As observed by 
Schlosser et al. [31], the seek profile of more recent disks 
includes a third component: for very small seek distances 
of less than C cylinders, the seek time is nearly constant 
and is effectively equivalent to the track-switch time, or 
the time needed for the head to settle on a track. 


The surface-serpentine disk layout adopted by more re- 
cent drives [1, 31] uses minimal seek time over a range of 
tracks. After mapping the last LBN to a given track, the 
disk firmware maps the next LBN to the adjacent track 
on the same surface rather than to the same track on a 
different surface. After mapping across C consecutive 
tracks, the next logical LBN is mapped to a sector on 
a track of a different surface C cylinders away. Thus, 
when accessing sequential run of LBNs, the disk heads 
will occasionally seek across the C cylinders to access 
the next logically sequential LBN. Using a disk extrac- 
tion tool [29], we determined C = 65, which covers the 
range of 624,000 LBNs (1200 SPT x 65 cylinders x8 
surfaces) for the 300 GB Cheetah 15K.5 disk. 


The Shortest-Positioning-Time-First (SPTF) request 
scheduler [34] implemented in the disk firmware can ef- 
fectively increase the throughput of serviced requests. It 
reorders requests to minimize the total positioning time 
(i.e., the sum of seek time and rotational latency) for each 
I/O request in the queue. With sufficiently large number 
of outstanding requests, it can lower the total positioning 
cost (i.e., the sum of the seek time and rotational latency) 
and service many more requests per unit of time [1]. 


2.2 Expressing request service time 


Issuing only one request at a time to the disk negates the 
benefits of the SPTF scheduler. The service time of a 
small random request will then be equivalent to the sum 
of average seek (equivalent to 1/3 of the full-strobe seek) 
and rotational delay of 1/2 a revolution. For the Cheetah 
15K.5 disk, this is is respectively 3.6 ms and 2 ms, re- 
sulting in service time of 5.6 ms. For a 7,200 RPM West- 
ern Digital RE3 nearline disk, the values are respectively 
6.9 ms and 4.2 ms, yielding service time of 11.1 ms. 


As these times for the two drives vary (by design) by a 
factor of 2, we will use instead a relative measure of 
service time, here called OP, that lets us ignore the dif- 
ferences between disk types and their generations. Thus, 
1 OP is the service time for a small random disk request 
or the measure of resources consumed when servicing a 
random disk I/O. 


Our enterprise-class disk has the capacity to service 
about 180 I/Os per second, while the nearline disk only 
90. These drives demonstrate a useful rule of thumb; 
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the average seek time of a disk is roughly equal to the 
time for a full rotation. Thus, the seek component of an 
OP is roughly 2/3 OP while the remaining portion is at- 
tributed to the rotational latency of half a revolution. This 
rule predicts that a disks can typically service 0.66 ran- 
dom requests per revolution. Our two drives do slightly 
better—O.71 for the enterprise-class disk and 0.75 for the 
nearline disk. This trend has held across many disk gen- 
erations with different rotational speeds and seek times. 


2.3 Measurements 


To quantify the benefits of proximal I/O, we measured 
the per-request service time in a batch of requests, here 
called a strand, under a variety of conditions. We chose 
at random a location on the disk and controlled the span 
of LBNs covered by the requests as well as the number 
of requests in the strand issued simultaneously (i.e., the 
disk queue depth). The measured response time for the 
entire strand, listed in Table 1, is the sum of the service 
times of the individual requests in the strand. 


To service a strand of requests, the disk must first seek 
to the general vicinity of the requests. Servicing the first 
request in the strand thus incurs the cost of an average 
seek in addition to some rotational latency. However, if 
the requests are near each other, servicing the remaining 
requests, incurs only some additional rotational latency 
and potentially minimal seek/track switch, since all re- 
quests in the strand are within C cylinders of each other. 
As we batch all requests, the disk is free to reorder them. 


Figure | shows the effective per-request service time as 
the number of requests in the strand (and hence the queue 
depth) increases, expressed both absolutely and in rela- 
tive units of OPs. The graph compares three different 
access modes. The track-approximate access limits the 
LBN span to 1024, the approximate size of the disk’s 
track, rounded down to the power of two. The prox- 
imal access uses a span of 100,000 LBNs. The semi- 
sequential access represents the best possible disk ac- 
cess after sequential streaming [30] — the requests are 
carefully chosen such that each request is positioned at 
a different track and at an offset equivalent to the mini- 
mal seek/track switch time. For semi-sequential access, 
we need to know detailed disk drive parameters. On the 
other hand, proximal I/O does not require the knowledge 
of track switch time or precise track size (SPT). 


We remark on the following trends. First, as the num- 
ber of requests in the strand increases, the effective per- 
request service time decreases from | OP to 0.39 OP for 
a strand of 8 requests — a 2.5x improvement over the 
case with one request per strand. Second, both track- 
approximate and proximal mode are very similar, despite 
the ten-fold difference in the LBN span. And third, the 








Strand response time (ms) 
Requests per strand 2 4 6 8 





Semi-sequential 5.9 1 8.6 9.9 
Track-approximate 74 10.8 13.9 17.0 
Proximal READ TA 112 14.2 17.1 
Proximal WRITE 8.6 129 168 20.4 














Table 1: Comparison of strand response times for Seagate 
Cheetah 15K.5. The mean service time of single READ request 
is 5.6 ms. For WRITE, it is 5.8 ms due to extra write-settle time. 
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Figure 1: Per-request cost of small requests in a strand. 


semi-sequential mode yields 0.23 OP for 8 requests per 
strand compared to 0.39 OP of proximal mode — an ad- 
ditional 1.7 times improvement. The results for WRITEs 
(not shown here) exhibit a similar trend; the slightly 
higher strand response time (see Table 1) is due to ad- 
ditional write-settle time for track-to-track seeks. 


2.4 Detailed model comparisons 


Two hypotheses might explain why we do not see values 
for proximal I/O that are closer to the semi-sequential 
mode. First, with randomly chosen blocks, some of them 
may land on the same track and the disk firmware opts 
to prefetch the remainder of the track before servicing 
other requests. Second, even without triggering prefetch- 
ing, the random placement of the requests can cause extra 
(missed) revolutions as we describe below. 


The semi-sequential access carefully chooses the place- 
ment of blocks so as to eliminate any rotational delays 
between requests after a track switch. With randomly 
chosen requests in proximal access, requests on different 
tracks can have rotational offset that is smaller than the 
time needed to switch tracks. The following paragraphs 
help illustrate how the disk scheduler minimizes overall 
rotational latency. They also show that it is the stochastic 
nature of the request placement rather than an artifact of 
the disk firmware causing the extra revolutions. 
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Proximal I/O with 8 Requests - Cheetah 15K.5 FC Disk (4 KB READs) 
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Figure 2: Per-request service time for strand of 8 requests with 
a varying number of READ requests outstanding at the drive. 
The first point is equivalent to a FIFO scheduler. The percent- 
ages next to the data points represent improvement relative to 
the FIFO data point. The second graph compares the modeled 
and the observed (measured) behavior. 


To verify our hypothesis, we set up an experiment, whose 
results are shown in Figure 2, where we fixed the number 
of requests per strand to 8 and varied the number of re- 
quests queued at the drive. With one request outstanding, 
the requests are serviced in FIFO order. As the number 
of outstanding requests increases, the disk scheduler can 
choose a request with smaller rotational latency, yielding 
a 32% reduction in per-request service time for a queue 
depth of 8 requests. This result confirms that proximal 
access effectively leverages the SPTF scheduler. 


To obtain the expected number of revolutions needed to 
service a strand of requests with proximal I/O, we de- 
veloped an analytical model for computing the expected 
strand response time and the probability distribution 
of missing revolution(s) for proximal I/O access. The 
model is based on the birthday paradox principle [33] 
and works at a high level as follows. It divides the disk 
into equally sized wedges or bins. When two requests 
(on different tracks) fall into the same bin, the disk heads 
cannot move fast enough to reach the second request in 
time and will have to service it during the next revolu- 
tion. Because of the high track switch time relative to 
the revolution time (0.4-0.8 ms), there are only a few 
bins (days in a month) available and several requests are 
likely to fall into the same bin (i.e., having birthday on 
the same day). See Appendix A for model details. 


Figure 3 demonstrates the high accuracy of our model, 
comparing the measured and modeled distributions of 
the strand response times. The two curves labeled 
Strands and Model are very similar with nearly identi- 
cal distributions. The curve labeled FIFO corresponds to 
measurements with one request outstanding at the disk 
drive, which is the scenario described in Figure 2. 
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Figure 3: Model-predicted vs. measured (observed) values. 


2.5 Practical considerations 


There are a few practical considerations for proximal I/O. 
First, we discovered that the manner in which we issue 
the requests in the strand is important. Issuing requests 
in a random order and relying solely on the scheduler’s 
ability to reorder them does not work. However, when 
we issue the requests in ascending LBN, the scheduler 
works as expected — it picks from the strand a request 
with the lowest positioning cost, services it first, and re- 
orders the remaining ones as necessary. Figure 4 shows 
the effect of issuing requests in ascending LBN for two 
different disk drives. The previously reported results in- 
clude this workaround. 


We attribute this limitation to two factors: the lack of 
the embedded CPU power (especially for nearline drives) 
and a firmware bug. We consulted disk manufactures 
who acknowledged both factors. In one case, our inquiry 
led to a fix in a subsequent firmware release. In prac- 
tice, even with current limitations, pre-sorting requests 
is not an issue. Second, to engage the request scheduler 
properly, the strand must be issued to a non-empty disk 
queue. Again, in practice this limitation is not a prob- 
lem. In many deployed systems, disks are seldom idle; 
they are busy servicing either client-generated workload 
or a variety of background scanning and grooming tasks. 
Third, we explored strands with at most 8 requests out- 
standing, although deeper queues would likely result in 
better results. This is again driven by a practical con- 
sideration. Many commercial storage systems [20, 8, 9] 
limit the number of pending requests to 4 or 8 to put a 
bound on the response time of a time-critical request. 


As a final remark note that our experiments assumed a 
purely random workload, which we simulated by a uni- 
form distribution of requests in the given range of LBNs. 
We believe that workloads that have more locality (but 
which are not sequential by nature) will benefit at least 
as much (if not more) from using proximal I/O. 
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Figure 4: The effects of sorting requests in ascending LBN order before issuing them to the disk. The full track data serves as a 
reference of request scheduling efficiency and corresponds to reading approximately one full track. Same trends hold for WRITEs. 


3 Data layout with proximal I/O 


Previous section explained how proximal I/O can retire 
several I/Os per revolution within the span of approxi- 
mately 100 tracks or half a million LBNs. In this section 
we describe the design principles for data layouts that 
leverage the proximal I/O construct. For our discussion, 
we will use the term block to designate the basic data 
allocation unit in file systems (typically 4 KB) to distin- 
guish them from sectors, or LBNs, which are typically 
512 bytes and serve as the basic unit of disk I/O. 


3.1 Increasing I/O density 


We start by considering how to increase the density of 
writes in the SRARW workload. The goal is to take a 
stream of random requests and produce sequences of I/O 
that will benefit from proximal I/O. As long as a storage 
system can produce a batch of several, say eight, write re- 
quests, and the data layout engine can place them within 
the span of ~ 100,000 blocks, each request will be ser- 
viced in time equivalent to much less than one revolu- 
tion and consume only 3.2 OP of resources (i.e., 0.4 OP 
per request as shown in Figure 1) regardless of the pre- 
vious position of the disk heads. In contrast, servicing 
eight blocks randomly scattered across the entire media 
will require 8 OP. Put differently, we need to find a way 
to increase the effective I/O density instead of spreading 
out a given batch of I/Os across the entire disk. 


We use two complementary approaches to achieving the 
necessary I/O density. First, we leverage indirection 
when assigning data to their physical locations akin to 
inodes in file systems that map file offset to a physical ad- 
dress at the underlying storage. Write-anywhere systems 
with no-overwrite semantics [19, 27] already take advan- 
tage of this approach; random writes at the storage sys- 
tem interface are mapped to the same segment (allocation 


area) at the physical layer. Our technique expands on 
this notion by allocating data to free space in the vicinity 
of the previously written logically related data. Second, 
when the I/O density of 6-8 requests within the zone of 
effectiveness of proximal I/O is not enough, we comple- 
ment the new type of write allocation described above 
with the use of a staging area. With a large-enough stag- 
ing area, one can selectively pick appropriate requests 
and write-allocate them to achieve the required I/O den- 
sity as determined by the disk technology. 


Our approach contrasts with existing ones in several 
ways. Traditional frame arrays that export logical vol- 
umes composed of disk drives organized into a RAID 
group typically do not have much flexibility in mapping 
their blocks to the underlying devices. They stripe data in 
a round-robin fashion across the constituent disks. Such 
systems do not require any additional metadata; they can 
compute the disk number and disk offset with simple 
modulo and divide arithmetic. However, a given write 
operation at the storage interface will land at a specific 
location on the disk, negating the desire for decoupling 
the front-end workload from the back-end. As a result, 
they need much larger write caches to achieve the re- 
quired write density compared to our approach. 


A back-end system with hundreds of large-capacity near- 
line disk drives, will require hundreds of GB of stag- 
ing area. Using that much NV-RAM (i.e., some form 
of battery-backed DRAM) would make the overall sys- 
tem cost prohibitively high (although there are commer- 
cial systems that offer such configurations [10, 9]). A 
more cost-effective solution is to use Flash memory an 
append-only log [23], at approximately 1/10" of the cost 
per GB. Another possibility is to use a dedicated disk as 
commercial relational database systems do for their log. 
However, with disk-based staging area, we would re- 
quire some additional DRAM to hold the data during the 
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destage operation to perform random reads from DRAM 
rather than disk during destage operation. 


3.2 Overwrites and snapshots 


In contrast to purpose-built storage systems [4, 7, 14, 
18]) that function as fixed-content repositories or that 
handle specialized scientific workloads that write lots 
of intermediate results, most writes in commercial sys- 
tems are (logical) overwrites or updates rather than new 
writes. For example, commercial databases and email 
servers [11] update individual records within database 
pages. New, or append-only, writes occur infrequently 
as these systems typically pre-allocate their table space 
by writing out empty (but non-zero) pages in bulk. Sim- 
ilarly, writes to objects containing virtual machine disk 
images create clones of a baseline golden image with rel- 
atively few unique blocks. 


Proximal I/O can also reduce the overhead of preserving 
multiple versions of the same block, be it snapshots for 
fast recovery after a crash or keeping diverging replicas 
of original files. A storage system will turn an update into 
a copy-on-write operation that will write data to a new lo- 
cation. A write-anywhere file layout, for example, lends 
itself to keeping snapshots with very little overhead, as in 
WAFL [19]. Other systems, such as frame arrays, with 
direct mapping of logical blocks to disk locations must 
issue an extra I/O to preserve old block versions. 


Both types of systems exhibit a similar shortcoming. A 
version (the new one in case of WAFL and the old one 
in case of frame arrays) of the data is put in a location 
that is convenient for the system without considering the 
semantic relationship to the original data. This can ad- 
versely affect the efficiency of subsequent reads. Log- 
ically related data may end up too far away from each 
other, incurring high positioning cost when they are both 
first written and then later on retrieved. Therefore, when 
a data layout engine maintains physical proximity of log- 
ically related data (be it a live version or a snapshot), it 
can leverage proximal I/O for copy-on-write of data that 
are updates in place from the client’s perspective. 


Most storage systems use RAID to protect their data 
against disk failures and grown media defects. The 
RAID read-modify-write (RMW) operation is not prox- 
imal I/O per se. However, we can combine copy-out 
and RMW operations and leverage proximal I/O; we can 
pipeline them such that we write out the just-read old 
version of the data within the effective span of proximal 
I/O in time before the disk spins around to write the new 
version of the block. With enough flexibility in the data 
layout, we can accomplish two RMW operations, that is 
four media accesses, in time equivalent to slightly more 
than one and a half revolution plus the initial seek. 
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3.3 Efficiency of reads 


So far we have discussed proximal I/O in the context of 
writes. However, it can also improve the efficiency of 
subsequent reads. The careful placement of related ver- 
sions of the data during writes allows the disk to collect 
physically non-contiguous blocks with minimal position- 
ing overhead for logically sequential access. Proximal 
I/O can access both the current data as well as the snap- 
shots with similar efficiency. 


Systems that do not place logically related data near 
each other are likely to perform differently depending on 
which version of the data they access. For example, a 
sequential table scan on an aged system may be less effi- 
cient than one performed against a snapshot made earlier. 
In contrast, when systems can place related data within 
the span of blocks that can be serviced with proximal 
I/O, they will likely exhibit much smaller variations in 
performance regardless of the version/snapshot they are 
reading. This is because both the old and new versions 
of data blocks, as well as logically related unmodified 
blocks will be in close proximity of each other, allowing 
proximal I/O to read either old or new versions of the 
data with high efficiency. 


3.4 Summary of key design points 


We summarize the key design points of a data layout en- 
gine suited for proximal I/O: 


1. Flexible mapping of object data to the physical on- 
disk location is an effective mechanism for increas- 
ing I/O density. Put differently, given a certain level 
of “randomness” in the front-end workload, systems 
with flexible per-block location pointers will need 
smaller staging area compared to systems that use 
rigid mapping of front-end blocks to on-disk physi- 
cal locations. 


2. The system needs to employ large-enough write 
staging area to achieve the required I/O density 
for the given front-end workload. Naturally, com- 
pletely random workloads will require the largest 
size. In practice, workloads are rarely purely ran- 
dom — there are typically hot spots where rel- 
atively small portion of the data is updated fre- 
quently. These hotspots reduce the amount of stag- 
ing area required for effective proximal I/O. 


3. A data layout engine with built-in efficient copy-on- 
write mechanism is well suited for proximal I/O; 
only some adjustments will be necessary to marry 
the constraints of proximal I/O with their already 
existing mechanisms. 
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We conclude by examining the various access patterns 
encountered by enterprise storage systems. In addition to 
serial reads after random updates (SRARW) that we tar- 
get with proximal I/O, we must consider random reads, 
sequential writes, and sequential reads not coupled with 
frequent (small) updates. It is our belief that storage sys- 
tems already employ effective techniques that can handle 
these other access patterns. In our view, proximal I/O is 
the missing link that fixes the inefficiencies of disk ac- 
cesses in today’s deployments. 


Increasing the size of read caches, for example by em- 
ploying devices based on flash memory, effectively copes 
with random reads. A modicum of NV-RAM can turn 
sequential writes into efficient disk accesses. In con- 
trast, increasing the efficiency of random updates re- 
quires large buffers. Finally, sequential reads (in the ab- 
sence of small updates) are easy to achieve. For exam- 
ple, workloads typical for fixed-content repositories of- 
ten write out and read entire objects. When individual 
objects would be too small for efficient disk I/O, they are 
grouped into larger allocation and access units [4, 35]. 


3.5 Target workloads 


Our work targets workloads in large-scale storage sys- 
tems and is motivated by the emergence of virtualized 
data center environments. The computing infrastructure 
includes a storage manager that allocates data from the 
underlying storage systems in large chunks or extents. 
For example, both the Oracle ASM [12] and VMWare 
VMES [13] allocate data in 1 MB chunks. Storage sys- 
tems in these environment in turn provide data man- 
agement features such as fine-grain snapshots, writable 
clones, etc. [20]. 


In these environments SRARW workloads are typical. 
The storage manager reads and prefetches data in full al- 
location units (chunks). However updates are typically 
at a finer granularity—for Oracle ASM, the update size 
is equivalent to the DBMS page size (typically 4-8 KB). 
For VM hypervisors, the update size is governed by the 
block size of the file system in the VM guest operating 
system. The writes from the storage managers to the 
underlying storage systems may turn into copy-on-write 
(rather than update-in-place) operations in order to pre- 
serve older versions data for disaster recovery. Our work 
focuses on these logical update-in-place operations with 
serial reads for prefetching or OLAP data scans. 


4 Prototype data layout engine 


The goal of our work is not to build an entire new file sys- 
tem. Instead, we have built a data layout engine (DLE) 


that uses our staging and allocation algorithms to demon- 
strate the feasibility of using proximal I/O to greatly 
improve random write performance while maintaining 
(near) optimal serial performance for SRARW work- 
loads. We believe these algorithms are readily adaptable 
to both update-in-place and write-anywhere file systems. 


Our DLE is, in effect, a stripped-down object storage sys- 
tem. We store logical extents of data in a flat namespace, 
where each extent is named only by a unique ID. Ex- 
tents can be created, read, written (and overwritten) and 
deleted. For simplicity, we only support reads and writes 
that are properly-aligned on block boundaries. Our DLE 
includes all of the necessary file systems structures to 
support this functionality, inode-like structures for each 
extent, allocation maps to track free space, and additional 
metadata to facilitate layouts friendly to proximal I/O. 


Because we are primarily interested in addressing the 
SRARW workload, our DLE is designed to efficiently 
support moderately large extents (1 MB or larger)—large 
enough for the serial read portion of the workload to ben- 
efit substantially from sequential layout. Our DLE works 
correctly for smaller extents, but we have not tested or 
optimized its performance in those cases. We believe 
that for those workloads, file systems would benefit more 
from using allocation algorithms that are different from 
those implemented in our prototype DLE. We describe 
here only the major pieces of our prototype necessary to 
understand the experiments presented in Section 5. 


4.1 Extent interface 


The DLE operates on extents. An extent is a contigu- 
ous logical range of bytes. The DLE decides how to best 
allocate extent data into fixed-size blocks (4 KB in our 
prototype) of the underlying storage subsystem—a logi- 
cal volume created from raw disks in a RAID group. In- 
ternally, each extent is represented by an inode, which is 
the root of a constant-height tree of indirect blocks. The 
leaf nodes of this tree contain the extent data. 


4.2 Staging area 


Our DLE uses a separate flash device as a staging area to 
accommodate random writes. When the DLE writes data 
into the staging area, it also updates the corresponding 
metadata including inode and indirect blocks for the just- 
written extent. Thus the staging area is the full-fledged 
home (albeit temporary) for new data, rather than a write 
cache with a copy of the data. When the system achieves 
the required I/O density (or the staging area runs out of 
capacity) we use proximal I/O to move data from the 
flash-based staging area to the on-disk location. More 
importantly, during destage, we make just-in-time allo- 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


139 


140 


cation decisions for the best on-disk placement in rela- 
tion to the data already-allocated to the disk. Because 
of our desire to keep previous versions of data for snap- 
shots, we don’t overwrite in place and instead write to a 
new location. When the destage operation finishes, we 
update the extent and DLE metadata accordingly to re- 
flect its new on-disk location. 


As part of its metadata, the staging area maintains a ta- 
ble mapping each block in the staging area to the extent 
and offset that the block belongs to. This allows us to 
efficiently locate items in the staging area for destage. 
Our DLE also uses the flash device to store its internal 
metadata, so metadata access does not interfere with the 
SRARW access patterns we wish to study. 


4.3 Allocation policies 


The more interesting feature of our prototype is the set 
of write allocation policies we implemented. When new 
data is written to an extent, we use the size of the write 
to determine whether to write the data to the staging area 
or directly to disk. In the current implementation this 
threshold is 168 KB—a number chosen to be approxi- 
mately the break-even point between the response time 
of a random I/O of that size and a full-track read for our 
current disks. Although, we have not examined alternate 
settings, we believe that the precise value has little qual- 
itative effect on our system; it only serves to distinguish 
between small and large writes. 


We have three different I/O allocation scenarios in our 
system: small writes allocated to the staging area, large 
writes allocated on disk, and collections of small writes 
allocated on disk when destaging. We manage the stag- 
ing area as an append-only log. Other more involved 
schemes are possible, but we have not explored them. 
When the staging area fills, we destage its full contents 
and start refilling it again from the beginning. As de- 
scribed earlier, when we write a block to the staging area, 
we update the metadata that points to it, freeing any on- 
disk block containing older data at that offset if that block 
is not used for a snapshot. 


When we receive a large write request, we write it di- 
rectly to disk, allocating new space if necessary, as when 
first writing an extent. Since we assume that extents 
will be large, and we want to provide good serial per- 
formance, we map large sequential ranges of an extent 
to similarly sized physical extents on disk that we call 
allocation ranges. By allocating at first fewer physical 
blocks than the size of the allocation range, we can pro- 
vide extra space for future updates and write-anywhere- 
style snapshots [19] at the cost of a corresponding frac- 
tion of serial bandwidth. We have not yet explored this 
capability in our prototype. 
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We follow the recommendations of Chen et al. [6] for 
stripe unit size to approximate the disk track size. Given 
the current disk parameters, we chose | MB as the size 
for the RAID stripe unit and allocation range for near- 
line drives and 512 KB for enterprise-class drives. Note 
however, that we need not know the precise disk parame- 
ters. The allocation unit size is a configurable parameter 
in our allocation algorithm and can loosely follow tech- 
nology trends over time as track size increases. 


Small writes (or updates) are first written to a staging 
area and held there until sufficient number of random 
updates is accumulated to achieve the required proximal 
I/O density. At that point we collect the relevant data 
(using our metadata info) and destage them to their fi- 
nal place. That is where alternative storage technologies 
work to our advantage; we can perform random reads 
directly from the staging area backed by e.g., flash mem- 
ory. If using disks instead, we need to perform (possibly 
multiple) sorting pass(es) and use additional DRAM. 


Destage is a two phase process. First, by scanning the 
staging area tables, we identify sets of blocks that can 
be allocated together. We do this by sorting the blocks 
first by extent and then by logical offset within each ex- 
tent. Second, from the extent metadata, we determine the 
allocation range(s) that contain related data i.e., data at 
the logical offsets immediately preceding and following 
the data being destaged. If there is enough space in the 
corresponding allocation range we simply write-allocate 
data there. When no additional space exists, we look for 
another allocation range that has enough free space to 
absorb the blocks and is in the vicinity of proximal I/O. 


In the worst case we inspect up to approximately 100 
allocation ranges (given current disk characteristics) for 
each group of blocks i.e., all blocks in the staging area 
belonging to a single extent and that are logically offset 
by the range of proximal I/O. In practice, this number is 
much smaller; when we wrote blocks to the staging area, 
we typically deallocate the older version of the block on 
disk, unless they are kept for snapshots. If we are destag- 
ing to an allocation range that had no underlying physical 
storage (i.e., we are writing to a sparse extent), we first 
allocate a physical extent for the allocation range, and 
then allocate the destage blocks within it. Figure 5 illus- 
trates the destaging process, showing the layout of data 
in both flash memory and disk. 


Our allocation algorithm uses two parameters dependent 
on disk technology trends: (a) the SPT governs the effi- 
cient allocation and serial I/O size and (b) minimal seek 
time governs the effective range of proximal I/O. The 
first parameter dictates the size of our allocation range, 
the second one, expressed in the number of allocation 
ranges, provides the flexibility in our allocation deci- 
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sions during destage operation. The parameters need not 
closely follow the technology trends. One adjustment for 
every few disk generations is sufficient. The trends are 
evolving to our advantage (see discussion in Section 5.4). 


4.4 RAID Layer 


Our user-level prototype also includes a stand-alone 
RAID subsystem, which presents a logical volume ab- 
straction to our DLE. This has several benefits compared 
to using an off-the-shelf one such as hardware RAID 
controller or a software implementation such as the md 
block device driver in Linux (our prototype platform). 


Our RAID implementation offers fine control over 
scheduling requests to individual disks. We use Linux 
SCSI generic device (/dev/sg) interface that bypasses 
the kernel block device’s buffer cache and the block de- 
vice schedulers. Linux can issue SCSI commands di- 
rectly to both SCSI/FC and SATA drives thanks to the 
libsata layer. Most importantly, our own implementa- 
tion more closely emulates the operation of an enterprise- 
class RAID layer and includes features that are missing 
from the aforementioned RAID implementations. 


First, we perform updates either by addition or subtrac- 
tion so as to minimize the number of disks engaged in 
I/O operations. Second, just like many other RAID sub- 
systems [20, 8, 9], we maintain additional information 
for every data block, including, a write generation num- 
ber for lost write protection and additional data check- 
sum. Since SATA disks support only 512-byte sectors, 
we must use a separate sector for the additional per-block 
information. We use 64 bytes of additional information 
per single 4 KB block grouped into one checksum block 
for every 63 data blocks, emulating the features of Data 
ONTAP [20] running on systems with commodity SATA 
disks. Thus, accessing one block above the DLE inter- 
face results in two distinct block accesses. 


5 Results 


We evaluate the effectiveness of proximal I/O using our 
DLE prototype. We first study random updates to large 
extents comprising a logical volume (LUN) exported by 
a storage system. and then analyze serial reads after our 
volume has been aged with many small random updates. 


5.1 Experimental setup 


Our prototype runs as a user-level process on a host with 
one dual core 3 GHz Intel CPU under Linux 2.6.24 (from 
stock Ubuntu Server 9.04 distribution). We use a 4+1 
RAID4 of 1 TB Western Digital RE3 (WD1002FBYS) 


SATA drives. We chose these 7200 RPM drives despite 
their lower performance compared to their enterprise- 
class counterparts because they are more cost effective. 


We fill our DLE with 16 MB extents to 89% of its capac- 
ity, writing them serially directly to the disks. We then 
issue 2000 small (4 KB) random updates per extent, thus 
re-allocating half of all blocks we initially wrote. Our 
DLE accumulates these updates to an SSD-based staging 
area, destaging them to back-end disk each time the stag- 
ing area fills. For measuring serial reads after writes, we 
read every single extent from our aged DLE (in random 
order). These requests for 974 logically serial blocks at 
a time (governed by the fan-out of our indirect blocks) 
result in several scatter-gather disk I/Os. Figure 5 shows 
the layout during the execution of updates. 


Before the DLE issues a set of requests to the RAID 
layer, we execute random I/Os to each of the constituent 
disks so as to avoid “short-stroking” (i.e., generating arti- 
ficially short seeks due to using only a subset of the disk 
capacity). We wait for the disks complete the random 
I/Os and exclude these unrelated I/Os from our analysis. 
Executing a set of small updates results in many more in- 
dividual disk I/Os than there are requests in the batch — 
the RAID layer needs to access the additional checksum 
blocks and to perform read-modify-write operations. 


5.2 Random updates 


The results for the random updates are summarized in 
Table 2. Each table row represents measurements with 
a different size of the staging area relative to the RAID 
group size. We collect statistics for each batch of I/O, 
where one batch is the disk I/O generated in destaging 
the accumulated changes to a single extent. Thus, a entire 
destage operation will generate one batch of I/O for each 
extent with at least one block in the staging area. We 
measure the mean response time and the number of user 
updates for each batch (columns 2 and 3). These times 
reflect the disk activity (i.e., the operations on the data). 
The DLE and extent metadata are updated on the SSDs, 
on average, with fewer than three I/Os for each batch. 
Since the SSD I/O service time is much smaller, the disk 
I/O dominates the batch response time. 


We collect the service time for each disk I/O (computed 
as the difference between the completions of the last two 
I/Os). We list the mean number of disk I/Os (reads and 
writes across all five drives) in column 4. Column 5 
shows the I/O amplification, the mean number of disk 
I/Os needed for each user-initiated update. Column 6 
shows the equivalent number of disk I/Os serviced per 
revolution. Finally, we show for the data and parity disks 
the mean number of I/Os, per-I/O service time, and the 
resulting disk utilization. 
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(b) Aged layout with updates still in the Flash memory staging area. 


(a) Initial setup with sequentially written extents. 


Figure 5: An example of block allocation in the prototype DLE. Initially, extents are written out directly to the RAID group. The 
Flash memory (SSD) holds extent and DLE metadata. Random updates are first put into the Flash staging area. As the layout ages, 
the extents are no longer contiguously laid out. However, the DLE maintains the proximity of the related blocks of the same extent 
by “pluging” holes in the layout created by overwrites two destage operations. 

















Stage Resp. User _ Disk Vo V/Os Data disks Parity disk 
Area time writes V/Os | ampl. p. rev. V/Os ST Util. V/Os ST Util. 
Baseline | 16.8 ms 1 8 8x 2.0 4 4.2 97% 4 4.2 97% 
1% 129.5 47.5 295.3 6.2x 5.3 43.3 2.0 65% | 122.0 1.0 91% 
2% 155.0 85.2 465.3 5.5x 7.2 66.1 1.4 61% | 201.1 0.7 93% 
3% 182.9 119.7 614.8 5.1x 8.1 86.2 1.2 57% | 269.9 0.6 94% 
4% 227.6 149.3 723.7 4.8x 7.7 99.8 1.1 47% | 324.3 0.6 86% 
5% 235.9 179.1 847.6 4.7x 8.7 | 117.0 1.0 48% | 379.8 0.6 95% 
6% 278.4 226.1 1014.7) 4.5x 9.0 | 136.3 0.9 44% | 469.3 0.6 97% 
71% 315.4 259.5 1151.5 4.4x 9.0 | 155.6 0.9 42% | 529.2 0.6 97% 
8% 320.9 254.8 1166.7) 4.6x 8.7 | 163.5 0.8 43% | 512.6 0.6 96% 























Table 2: Random updates for various sizes of the staging area. Resp. time is the response time of the batch of user I/Os being 
destaged and ST is the the per disk I/O service time, both reported in milliseconds. I/O ampl. is the ratio of disk I/Os to user writes. 
The I/Os per revolution represents the number of I/Os serviced by a drive averages across both data and parity disks. The base data 
represents a system without staging area, whereby every user write results in RMW at the RAID back-end. 
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In our baseline data, we show the performance when a 
batch contains exactly one block. This has a latency of 
16.8 ms, and results in 8 separate disk I/Os (an I/O ampli- 
fication of 8x). Writing a single block to RAID4 results 
in 4 individual disk I/Os—a read and write of both the 
data disk and the parity disk. Updating the checksum 
incurs an additional 4 I/Os. 


Next, we explore how our write allocation, coupled with 
1% of staging area, leverages proximal I/O to improve 
the efficiency of disk accesses. We observe that, on av- 
erage, 47.5 user updates result in 295.3 disk I/Os for a 
6.2 amplification that are serviced in 139.5 ms. The 
per-I/O service time for the data and parity disk is thus 
2.0 ms and 1.0 ms respectively. Even though the RAID4 
parity disk is more efficient, it has to service many more 
1/Os and thus is the bottleneck. 


Since the batch of I/Os is serviced by proximal I/O, we 
can retire on average 5.3 I/Os per revolution. Yet, as 
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shown in Figure 1, we were only able to retire 1.9 I/Os 
per revolution in a strand of 8 requests (17.1 ms to retire 
8 read requests with 4 ms rotational time). We achieve 
this improvement because of the greater I/O density; we 
are writing strands that contain many more blocks, typi- 
cally within a single track or two. Also, the RAID layer 
must also update the checksum block for data blocks that 
are being written out. This further increases the number 
of disk I/Os, but also the I/O density — for most data 
blocks the checksum block is on the same track. For the 
same reason, we also see only a 6.2 write amplification 
(instead of 8x); we need to access the same checksum 
block only once for several data block updates. 


As the size of the staging area increases, the batch size 
increases (from 47.5 to 369.7 updates for staging area 
of 1% and 10% respectively) and the destage operation 
for each batch becomes more efficient. The I/O amplifi- 
cation decreases from 6.2 to 4.5x and the number of 
disk I/Os serviced per revolution grows from 5.3 to 8.6. 
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5.3 Serial reads after random updates 


Table 3 shows the results for sequential reads from an 
aged layout as depicted in Figure 5(b). Our system reads 
up to 974 logically consecutive blocks. Given the ex- 
tent sizes, our DLE requests on average 819.2 blocks 
that are returned with a mean response time of 9.5 ms. 
This translates to effective bandwidth of 86.2 MB/s per 
disk (345 MB / 4 data disks in RAID group). Because 
of fragmented layout, the request for up to 974 logical 
blocks results in a batch of 47.5 logically sequential runs 
of blocks issued to the RAID group that are further bro- 
ken into individual per-disk I/Os. Because of striping 
and the need to access the checksum block in addition to 
the data blocks, each disk services on average 20.7 in- 
dividual I/O requests. Given the | MB stripe unit size 
on our RAID setup, the original request for 974 logically 
sequential blocks is typically serviced by three disks. 


We repeated the same experiment on the non-aged data 
layout depicted in Figure 5(a) where extents are laid out 
on physically contiguous disk sectors. We measured 
mean response time of 9.2 ms, which translated to per- 
disk bandwidth of 89 MB/s. Thus, sequential reads after 
random updates on our system are within 3% of the the 
best-case scenario of physically contiguous layout. 


Finally, we evaluated the performance of serial reads af- 
ter random updates with write-anywhere-style allocation. 
In our system, we induced this behavior by eliminating 
the staging area and writing out data by greedily plug- 
ging the holes created by deletes of earlier versions of 
the data (we did not implement an LFS-style segment 
cleaner). In this setup, logically serial data increasingly 
dispersed over the disk over time, resulting in dramati- 
cally lower bandwidth compared to the baseline case. 


5.4 System cost and technology trends 


Lowering overall cost is one of the driving forces behind 
changing the internal architecture and design of commer- 
cial enterprise-scale storage systems. The adjustments to 
the write allocation policies presented here coupled with 
deployment of some additional device(s) for the stag- 
ing area is but one example of such force. Making the 
prevalent access pattern (e.g., the serial read after random 
write described here) more efficient allows the system to 
run workloads with larger I/O demand for the same dol- 
lar cost. We now explore the trade off between the cost of 
additional hardware for the staging area and the resulting 
improvement in the back-end disk I/O capacity. 


Consider the WD1002FBYS disk drive we used in our 
experiments. It has a measured average seek time of 
7.5 ms and rotational speed of 7,200 RPM. With the time 
of 8.4 ms for a single rotation, the mean time to service 









































Per disk statistics 
Read BW. iODiff | VOs_ Util. 
Baseline 89.0 MB/s 11.7 85% 
Aged layout 86.2 MB/s = -3% | 20.7 82% 
Write-anywhere | 2.6MB/s -97% | 210.2 85% 
Aged layout reads — detailed statistics 
mean | min max 
Request response time (ms) 95 | 65 32.9 
Request size (4 KB blocks) 819.2 | 200 974 
Requests per batch 43.9 28 114 
Span of blocks 1002.8 | 914 = 1008 
Number of I/Os per disk 20.7 2 58 
Per-disk resp. time (ms) 8.8 | 0.9 32.8 











Aged layout — read response time quantiles 
20% 30% 40% 50% 60% 70% 80% 90% 95% 
MS 1.6 7.7 Wo 8.2 8v4 2251 278.3129 





10% 
7.4 


Table 3: Serial reads after random updates. 


a random I/O is 11.7 ms. Thus our drive can perform 86 
random IOPS. With a street price of $130, this means we 
are paying $1.52 per IOPS. Now consider the effects of 
adding 1% of capacity as a flash staging area. Table 2 
shows that in this configuration we can write 5.3 blocks 
in a single revolution. Adding an average seek means 
that our system performs 5.3 writes in 15.9 ms, or 3 ms 
per write. This is equivalent to 333 random IOPS, an 
improvement of 289% over the basic disk solution. 


Adding the flash staging area increases the cost of the 
system. With cost of flash at $3.13/GB, based on a 
160 GB Intel X25 SSD with a street price of $500, a 1% 
staging area for our | TB drive requires 10 GB of flash, 
increasing our cost by $31.30, to a total of $161.30 for a 
configuration capable of 333 random IOPS. Thus, in our 
system, the cost is $0.48 per IOPS, less then a third of 
the per IOPS cost of the raw disk drive. 


With our system, we pay an extra 25% to add a flash 
staging area and in return we get nearly a 3x perfor- 
mance increase on random writes, while preserving near 
sequential on-disk layout.” Note that these numbers are 
pessimistic. They assume that the staging area is scaled 
to the entire disk-based storage capacity. In reality, the 
staging area need only be 1% of the write working set, 
further reducing the flash costs. 


We conclude by considering the impact of technology 
trends on the effectiveness of proximal I/O. The disk 
trends are in our favor. Growing areal media densities 


?These numbers are shown here only to illustrate our point. our 
simplified model considers only the cost of individual devices. Also, 
we ignore many practical system issues such as RAID group size, etc. 
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(i.e., the increase in both SPT and track density), increase 
the span of LBNs over which proximal I/O will be ef- 
fective. Larger span gives more options to our system 
to lay out its data. Similarly, as the flash memory cost 
decreases, the relative size of the deployed staging area 
may likely increase relative to the disk storage capacity. 
This will also increase the effectiveness of our destage 
process. In the end however, the ratio of flash memory to 
disk capacity will be driven by customer needs and their 
ability to get the right performance for the least cost. 


6 Related Work 


Our work explores the design principles for a data lay- 
out suitable for the SRARW workload while leveraging 
a more efficient disk access pattern. We review some ad- 
ditional related work not mentioned earlier in the paper. 


Our data layout is similar in principle to the data jour- 
nalling mode employed by some journalling file sys- 
tems [5, 26, 32]. As in those systems, we write data 
initially to a designated staging area (journal, separate 
device etc.) and later on destage them to their final loca- 
tion. The difference in our approach, which utilizes prox- 
imal I/O, is that the efficiency of our destage operation 
is much much higher; journalling file systems typically 
write to a specific location on the disk constrained by 
their “overwrite-in-place” policy. In contrast, we destage 
data with fewer constraints offered by the span of blocks 
in proximal I/O. Additionally, we can consider the best 
location with respect to the related data and thus make 
the write operations more efficient. 


The Disk Caching Disk (DCD) [24] explored a differ- 
ent technique for using write caching to improve stor- 
age system performance. DCD aggregates small writes 
on a separate caching disk, achieving serial performance 
when flushing dirty data from the buffer cache. During 
idle periods it destages data from the cache disk to its 
home on the primary disk. This design improves the la- 
tency of small writes, but does not leverage proximal I/O 
to achieve better I/O efficiency. A similar technique has 
also been used in database systems [16]. 


The idea of proximal I/O combines and expands on the 
observations about (1) efficient disk access across ad- 
jacent tracks with minimal positioning cost [30] and 
(2) minimal positioning cost when seeking across an 
ever-increasing range of cylinders [31]. Unlike semi- 
sequential access, however, proximal I/O does not re- 
quire detailed knowledge of disk geometry or specialized 
device interface that provides the position of the next 
semi-sequential block relative to the current position; it 
works on systems with standardized interfaces (SATA or 
SCSD and off-the-shelf commodity disk drives. 
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Our DLE design relies on write-anywhere allocation, 
similar to LFS [27], WAFL [19] and related designs 
such as ZFS [22] and btrfs [3]. Like these systems, it 
never overwrites old data in place, making it possible 
to preserve older versions of data, or snapshots, with 
minimal I/O overhead. Traditional write-anywhere file 
systems batch temporally related dirty data for efficient 
disk writes. Thus logical data locality is lost, requir- 
ing segment cleaning [27] or other defragmentation tech- 
niques [15] to re-establish sequential layout. In contrast, 
our DLE allocates data close to logically related on-disk 
data, preserving logical locality with proximall/O plus 
the staging area to achieve efficient write performance. 


The Loge [17] disk controller represents another varia- 
tion of write-anywhere; it virtualizes block addresses so 
that it can write incoming data at the free locations near- 
est to the current disk head location. However, the work 
does not target SRARW workloads; it explicitly assumed 
that randomly written data would also be randomly read. 
In principle, many aspects of our DLE design could be 
implemented in a Loge-like disk controller rather than in 
a file system, although it would loose the semantic infor- 
mation about which blocks of data are logically related 
and are likely to be read together. Moreover, our design 
does not require detailed knowledge of disk head posi- 
tion and thus is time-invariant. 


Appendix A: Proximal I/O model 


Our objective is to find the expected number of revolu- 
tions needed to serve D requests in a strand. Recall that 
a strand is a collection of proximal I/Os that are sent 
together to a disk and that are close enough such that 
servicing any one of the requests incurs a minimal seek 
equivalent to head/track switch time. 


Assume there are SPT sectors per track and D requests, 
each of size S sectors, and a seek between each request in 
the strand equivalent to head switch time, H. We express 
H in terms of the number of sectors that pass by the disk 
head during track switch. We can formulate the problem 
of finding the expected number of revolutions in terms 
of binning requests into B bins. Each request of size S is 
then randomly placed into any one of the K slots along a 
circular track. This is analogous to a roulette wheel with 
K slots and D balls spun simultaneously. 


With such a formulation, if two balls (i.e., requests on 
different tracks) fall into the same bin, that is, if they are 
within K x S/H slots, the disk arm cannot service those 
requests in a single revolution and we get 


pa SPT _Kxs 
- AH 
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Let’s express the probability, Q;, that no two requests are 
in the same bin. This is analogous to the probability that 
no extra revolutions are required when servicing a sched- 
ule of D requests in a strand. This can be solved by the 
birthday paradox problem, where we look for the proba- 
bility that no two people out of a group of n people in a 
room have a birthday on the same day out of b possible, 
and equally likely, birthdays. 


b! 
(b—n)!b" 
Using our analogy, we have B bins, which is equivalent 
to the b possible birthdays, and D, the number of requests 
in a strand, is the number of people, n. This is equivalent 
to not using any extra revolutions (since each request is 
in a separate bin) when servicing the D requests. 


Q1= 


We can now calculate the probability that at least one 
extra revolution will be required as 


Pi} =1-Q). 


More generally, the probability that a birthday is shared 
by exactly k (and no more) people is expressed as [33] 


Q;(n,b) = 

[n/k} aon (b— pe ik 
> biki! i 5 Le 1 Tn ik 

= i(k!)'(n— b 

This is equivalent to the probability of servicing a given 
strand in k revolutions or using exactly k — 1 extra revo- 
lutions. This assumes that each request landed on a sep- 
arate track and requires a track switch when servicing it. 


The probability that we will require at least k extra revo- 
lutions in servicing a request (or that k+ 1 or more people 
share a birthday in our analogy), we have 


k 
P,=1-)' 9; 
i=l 


Now let’s express the probability that we will not use 
any extra revolutions when servicing a strand as a func- 
tion of number of sectors, H, that pass by during track 
switch time. With values for the Seagate Cheetah 15K.5 
disk’s first zone we have SPT = 1200, track switch time 
0.475 ms, H = SPT x [0.475 ms/4 ms] = 142, and the 
number of bins B = 8.45. Therefore, we set |B| = 8, 
meaning that this disk can at best schedule 8 proximal 
I/Os in a revolution when the requests are properly offset 
from each other. With strand where D = 8, the probabil- 
ity of not using any extra revolutions is close to zero. 


We express the expected number of revolutions for ser- 
vicing a strand of D requests as 


E[Revs] = = st y iQ;(D,SPT/H) 


For D = 8, we get E[Revs] = 3.4, assuming that each 
request lands on a separate track. Normalized (or per- 
request) number of revolutions is then 0.43. 


Next, we assume eight requests in a strand even though 
this disk can service at best six in a single revolution. We 
choose the value of eight because it gives, on average, 
12% lower per-request service time compared to a strand 
with D = 6. Adding an initial average seek of 3.5 ms 
for each strand, the per-request service time is 2.16 ms 
or 17.26 ms for the entire strand of D = 8 with variance 
6? =9 ms. This comes to within 1% of the measured 
mean service time of 17.14 ms with 0 = 9.6 ms. 


Finally, we examine the probability of using exactly one, 
two, three, and so on, revolutions when servicing a strand 
of D = 8 requests. From our model, the most prevalent 
value is two extra revolutions (three in total). When D = 
6 (with H = 7 for our disk), the probability of not using 
any additional revolutions is still only 0.02. 
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Abstract 


Previous approaches to RAID scaling either require a 
very large amount of data to be migrated, or cannot toler- 
ate multiple disk additions without resulting in disk im- 
balance. In this paper, we propose a new approach to 
RAID-0 scaling called FastScale. First, FastScale mini- 
mizes data migration, while maintaining a uniform data 
distribution. With a new and elastic addressing function, 
it moves only enough data blocks from old disks to fill 
an appropriate fraction of new disks without migrating 
data among old disks. Second, FastScale optimizes data 
migration with two techniques: (1) it accesses multiple 
physically successive blocks via a single I/O, and (2) it 
records data migration lazily to minimize the number of 
metadata writes without compromising data consistency. 
Using several real system disk traces, our experiments 
show that compared with SLAS, one of the most effi- 
cient traditional approaches, FastScale can reduce redis- 
tribution time by up to 86.06% with smaller maximum 
response time of user I/Os. The experiments also illus- 
trate that the performance of the RAID-0 scaled using 
FastScale is almost identical with that of the round-robin 
RAID-0. 


1 Introduction 


Redundant Array of Inexpensive Disks (RAID) [1] was 
proposed to achieve high performance, large capacity 
and data reliability, while allowing a RAID volume to 
be managed as a single device. As user data increase 
and computing powers enhance, applications often re- 
quire larger storage capacity and higher I/O performance. 
To supply needed capacity and/or bandwidth, one solu- 
tion is to add new disks to a RAID volume. This disk 
addition is termed “RAID scaling”. 

To regain uniform data distribution in all disks includ- 
ing the old and the new, RAID scaling requires certain 
blocks to be moved onto added disks. Furthermore, in 


today’s server environments, many applications (e.g., e- 
business, scientific computation, and web servers) access 
data constantly. The cost of downtime is extremely high 
[2], giving rise to the necessity of online and real-time 
scaling. 

Traditional approaches [3, 4, 5] to RAID scaling 
are restricted by preserving the round-robin order after 
adding disks. The addressing algorithm can be expressed 
as follows for the i’” scaling operation: 


fia): { 


where block b of disk d is the location of logical block x, 
and Nj; gives the total number of disks. Generally speak- 
ing, as far as RAID scaling from m disks to m+n is 
concerned, only the data blocks in the first stripe are not 
moved. This indicates that almost 100 percent of data 
blocks have to be migrated no matter what the numbers 
of old disks and new disks are. There are some efforts 
[3, 5] concentrating on optimization of data migration. 
They improve the performance of RAID scaling by a cer- 
tain degree, but do not overcome the limitation of large 
data migration completely. 

The most intuitive method to reduce data migration is 
the semi-RR [6] algorithm. It requires a block movement 
only if the resulting disk number is one of new disks. The 


d=x mod N; 


b=x/N; 1) 


algorithm can be expressed as follows for the i” scaling 
operation: 
pox | Be) if (x mod N;) < Nj-1 
a= { filx) otherwise (2) 


Semi-RR reduces data migration significantly. Unfortu- 
nately, it does not guarantee uniform distribution of data 
blocks after subsequent scaling operations (see section 
2.4). This will deteriorate the initial equally distributed 
load. 

In this paper, we propose a novel approach called 
FastScale to redistribute data for RAID-0 scaling. It ac- 
celerates RAID-0 scaling by minimizing data migration. 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


149 


150 





I/m 





1/(m+n) 






































Do D; D1 Dn Dinin-1 


old disks ——_ \—new disks— 


Figure 1: Data migration using FastScale. Only data blocks are moved 
from old disks to new disks for regaining a uniform distribution, while 
no data is migrated among old disks. 


As shown in Figure 1, FastScale moves only data blocks 
from old disks to new disks enough for preserving the 
uniformity of data distribution, while not migrating data 
among old disks. Consequently, the migration fraction of 
FastScale reaches the lower bound of the migration frac- 
tion, n/(m-+n). In other words, FastScale succeeds in 
minimizing data migration for RAID scaling. 

We design an elastic addressing function through 
which the location of one block can be easily computed 
without any lookup operation. By using this function, 
FastScale changes only a fraction of the data layout while 
preserving the uniformity of data distribution. FastScale 
has several unique features as follows: 


e FastScale maintains a uniform data distribution af- 
ter RAID scaling. 


e FastScale minimizes the amount of data to be mi- 
grated entirely. 


e FastScale preserves a simple management of data 
due to deterministic placement. 


e FastScale can sustain the above three features after 
multiple disk additions. 


FastScale also exploits special physical properties to 
optimize online data migration. First, it uses aggre- 
gate accesses to improve the efficiency of data migration. 
Second, it records data migration lazily to minimize the 
number of metadata updates while ensuring data consis- 
tency. 

We implement a detailed simulator that uses DiskSim 
as a worker module to simulate disk accesses. Under sev- 
eral real-system workloads, we evaluate the traditional 
approach and the FastScale approach. The experimental 
results demonstrate that: 


e Compared with one of the most efficient traditional 
approaches, FastScale shortens redistribution time 
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by up to 86.06% with smaller maximum response 
time of user I/Os. 


e The performance of the RAID scaled using 
FastScale is almost identical with that of the round- 
robin RAID. 


In this paper, we only describe our solution for RAID- 
0, 1.e., striping without parity. The solution can also work 
for RAID-10 and RAID-01. Although we do not handle 
RAID-4 and RAID-5, we believe that our method pro- 
vides a good starting point for efficient scaling of RAID- 
4 and RAID-5 arrays. 


2 Minimizing Data Migration 


2.1 Problem Statement 


For disk addition into a RAID, it is desirable to ensure an 
even load on all the disks and minimal block movement. 
Since the location of a block may be changed during a 
scaling operation, another objective is to quickly com- 
pute the current location of a block. 

To achieve the above objectives, the following three 
requirements should be satisfied for RAID scaling: 


e Requirement | (Uniform data distribution): If there 
are B blocks stored on m disks, the expected number 
of blocks on each disk is approximately B/m so as 
to maintain an even load. 


e Requirement 2 (Minimal Data Migration): During 
the addition of n disks to a RAID with m disks stor- 
ing B blocks, the expected number of blocks to be 
moved is B x n/(m-+n). 


e Requirement 3 (Fast data Addressing): In a m-disk 
RAID, the location of a block is computed by an 
algorithm with low space and time complexity. 


2.2 Two Examples of RAID Scaling 


Example 1: To understand how the FastScale algorithm 
works and how it satisfies all of the three requirements, 
we take RAID scaling from 3 disks to 5 as an example. 
As shown in Figure 2, one RAID scaling process can be 
divided into two stages logically: data migration and data 
filling. In the first stage, a fraction of existing data blocks 
are migrated to new disks. In the second stage, new data 
are filled into the RAID continuously. Actually, the two 
stages, data migration and data filling, can be overlapped 
in time. 

For the RAID scaling, each 5 sequential locations in 
one disk are grouped into one segment. For the 5 disks, 
5 segments with the same physical address are grouped 
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Figure 2: RAID scaling from 3 disks to 5 using FastScale, where m > n. 


into one region. In Figure 2, different regions are sepa- 
rated with a wavy line. For different regions, the ways to 
data migration and data filling are exactly identical. 

In a region, all of the data blocks within a parallelo- 
gram will be moved. The base of the parallelogram is 
2, and the height is 3. In other words, 2 data blocks 
are selected from each old disk and migrated to new 
disks. The 2 blocks are sequential, and the start address 
is disk_no. Figure 2 depicts the moving trace of each 
migrating block. For one moving data block, only its 
physical disk number is changed while its physical block 
number is unchanged. As a result, the five columns of 
two new disks will contain 1, 2, 2, 1, and 0 migrated data 
blocks, respectively. Here, the data block in the first col- 
umn will be placed upon disk 3, while the data block in 
the fourth column will be placed upon disk 4. The first 
blocks in columns 2 and 3 are placed on disk 3, and the 
second blocks in columns 2 and 3 are placed on disk 4. 
Thus, each new disk has 3 data blocks. 

After data migration, each disk, either old or new, has 
3 data blocks. That is to say, FastScale regains a uni- 
form data distribution. The total number of data blocks 
to be moved is 2 x 3 = 6. This reaches the minimal num- 
ber of moved blocks, (5 x 3) x (2/(3+2)) = 6. We can 
claim that the RAID scaling using FastScale can satisfy 
Requirement | and Requirement 2. 

Let us examine whether FastScale can satisfy Require- 
ment 3, i.e., fast data addressing. To consider how one 
logical data block is addressed, we divide all the data 
space in the RAID into three categories: original and un- 
moved data, original and migrated data, and new data. A 
conclusion can be drawn from the following description 
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Figure 3: RAID scaling from 2 disks to 5 using FastScale, where m <n. 


that the calculation overhead for the data addressing is 
very low. 


e The original and unmoved data can be addressed 
with the original addressing method. In this exam- 
ple, the ordinal number of the disk holds one block x 
can be calculated: d =x mod 3. Its physical block 
number can be calculated: b = x/3. 


e The addressing method for original and migrated 
data can be obtained easily from the above descrip- 
tion about the trace of the data migration. b = x/3. 
For those blocks in the first triangle, i.e., blocks 0, 
3, and 4, we have d = dy +3. For those blocks in 
the last triangle, i.e., blocks 7, 8, and 11, we have 
d =dy +2. Here, dp is their original disk. 


e Each region can hold 5 x 2 = 10 new blocks. In 
one region, how those new data blocks are placed 
is shown in Figure 2. If block x is a new block, it 
is the y” new block, where y = x —3 x 11. Each 
stripe holds 2 new blocks. So, we have b = y/2. 
The first two new blocks in each region are placed 
on Blocks 0 of Disk 0 and 4. For the other blocks, 
d = (y mod 2)+(b mod 5) — 1. 


Example 2: In the above example, the number of the 
old disks m and the number of the new disks n satisfy the 
condition: m > n. In the following, we inspect the case 
when m <n. Take RAID scaling from 2 disks to 5 as an 
example. Here, m= 2 andn =3. 

Likewise, in a region, all of the data blocks within 
a parallelogram will be moved. The base of the paral- 
lelogram is 3, and the height is 2. 3 consecutive data 
blocks are selected from each old disk and migrated to 
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Algorithm: Addressing(t, H, s, x, d, b) 
Input: 
t: scaling times 
H: scaling history, H[0],..., H[t] 
s: total number of data blocks in one disk 
x: logical block number 
Output: 
d: the disk holding Block x 
b: physical block number 











1: if t= 0 then 
2: m+H[0], d«xmodm, b«+x/m 
3: exit 
4:m<-H[t-l], n<H[t]-m, 6<#m-H[0] 


5: ifx € [0,mxs—1] //an original data block 
6: Addressing(t-1, H, s, x, dg, bo) 

7 b; — (bp - 6) mod (m+n) 

8 


if by € [dg, do +n -1] // to be moved 


9: d— Moving(do, b}, m,n), b«bo 
10: else // not moved 

11: d«do, b+bo 

12:else // anew data block 


13: Placing(x, m, n, s, 6, d, b) 








Table 1: The addressing algorithm using in FastScale. 


new disks. Figure 3 depicts the trace of each migrat- 
ing block. Similarly, for one moving data block, only its 
physical disk number is changed while its physical block 
number is unchanged. As a result, five columns of three 
new disks will have a different number of existing data 
blocks: 1, 2, 2, 1, 0. Here, the data block in the first 
column will be placed upon disk 3, while the data block 
in the fourth column will be placed upon disk 4. Unlike 
the first example, the first block in columns 2 and 3 are 
placed on disks 2 and 3, respectively. Thus, each new 
disk has 2 data blocks. 

Similar to the first example, we can demonstrate that 
the RAID scaling using FastScale can satisfy the three 
requirements. 


2.3 The Addressing Algorithm 


Table 1 shows the algorithm to minimize data migration 
required by RAID scaling. The array H records the his- 
tory of RAID scaling. H/0] is the initial number of disks 
in the RAID. After the i” scaling operations, the RAID 
consists of H/i/] disks. 

When a RAID is constructed from scratch (i.e., t = 0), 
it is around-robin RAID actually. The address of block x 
can be calculated via one division and one modular (line 
2), 

Let us inspect the “” scaling, where n disks are added 
into a RAID made up of m disks (line 4). 

(1) If block x is an original block (line 5), FastScale 
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Function: Moving(do, b;, m, n) 








Input: 
do: the disk of the original location 
b,: the original location in a region 
m: the number of old disks 
n: the number of new disks 





Output: 
return value: new disk holding the block 
1: ifm>n 
2: ifb,; <n-l 
3: return dg+m 
4: if b; > m-1 
5: return dyg+n 
6: return m+n-1- (b;-dg) 
7. ifm<n 
8: if b; <m-l 
9: return dg+m 
10: ifb; >n-l 
11: return dp+n 
12: return dg+ b)+1 








Table 2: The Moving function. 


calculates its old address (do, bo) before the gh scaling 
(line 6). 


e If (do, bo) needs to be moved, FastScale changes the 
disk ordinal number while keeping the block ordinal 
number unchanged (line 9). 


e If (do, bo) does not need to be moved, FastScale 
keeps the disk ordinal number and the block ordi- 
nal number unchanged (line 11). 


(2) If block x is a new block, FastScale places it via 
the Placing() procedure (line 13). 

The code of line 8 is used to decide whether a data 
block (do, bo) will be moved during RAID scaling. As 
shown in Figures 2 and 3, there is a parallelogram in each 
region. The base of the parallelogram is n, and the height 
is m. If and only if the data block is within a parallelo- 
gram, it will be moved. One parallelogram mapped to 
disk do is a line segment. Its beginning and end are do 
and do +n-— 1, respectively. If b; is within the line seg- 
ment, block x is within the parallelogram, and therefore it 
will be moved. After a RAID scaling by adding n disks, 
the left-above vertex of the parallelogram proceeds by n 
blocks (line 7). 

Once a data block is determined to be moved, 
FastScale changes its disk ordinal number with the Mov- 
ing() function. As shown in Figure 4, a migrating par- 
allelogram is divided into three parts: a head triangle, 
a body parallelogram, and a tail triangle. How a block 
moves depends on which part it lies in. No matter which 
is bigger between m and n, the head triangle and the tail 


USENIX Association 
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Procedure: Placing(x, m, n, s, 6, d, b) 








Input: 
x: logical block number 
m: the number of old disks 
n: the number of new disks 
s: total number of data blocks in one disk 
6: offset of the first region 
Output: 
d: new disk holding the block 
b: physical block of new location 
yx -mxs 
:boy/n row y mod n 
: ex (b-d) mod (m+n) 
:if e<n 
if row < e+l 
d+ row 
else 
d+ row+m 





: else 
0: d+ row+e-n+1 








Table 3: The procedure to place new data. 


triangle keep their shapes unchanged. The head triangle 
will be moved by m disks (line 3, 9), while the tail tri- 
angle will be moved by n disks (line 5, 11). However, 
the body is sensitive to the relationship between m and n. 
The body is twisted from a parallelogram to a rectangle 
when m > n (line 6), while from a rectangle to a parallel- 
ogram when m <n (line 12). FastScale keeps the relative 
locations of all blocks in the same column. 

When block x is in the location newly added after the 
last scaling, it is addressed via the Placing() procedure. 
If block x is a new block, it is the yh new block (line 1). 
Each stripe holds n new blocks. So, we have b = y/n 
(line 2). The order of placing new blocks is shown in 
Figures 2 and 3 (line 4-10). 

This algorithm is very simple. It requires fewer than 
50 lines of C code, reducing the likelihood that a bug will 
cause a data block to be mapped to the wrong location. 


2.4 Property Examination 


The purpose of this experiment is to quantitatively char- 
acterize whether the FastScale algorithm satisfies the 
three requirements, described in Subsection 2.1. For this 
purpose, we compare FastScale with the round-robin al- 
gorithm and the semi-RR algorithm. From a 4-disk array, 
we add one disk repeatedly for 10 times using the three 
algorithms respectively. Each disk has a capacity of 128 
GB, and the size of a data block is 64 KB. In other words, 
each disk holds 2 x 1024? blocks. 

Uniform data distribution. We use the coefficient of 
variation as a metric to evaluate the uniformity of data 


head 
body 
tail 
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n body 
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Figure 4: The variation of data layout involved in migration. 


distribution across all the disks. The coefficient of vari- 
ation expresses the standard deviation as a percentage of 
the average. The smaller the coefficient of variation is, 
the more uniform the data distribution is. Figure 5 plots 
the coefficient of variation versus the number of scal- 
ing operations. For the round-robin and FastScale algo- 
rithms, both the coefficients of variation remain 0 percent 
as the times of disk additions increases. 

Conversely, the semi-RR algorithm causes excessive 
oscillation in the coefficient of variation. The maximum 
is even 13.06 percent. The reason for this non-uniformity 
is given as follows. An initial group of 4 disks makes the 
blocks be placed in a round-robin fashion. When the first 
scaling operation adds one disk, then 1/5 of all blocks, 
where (x mod 5) > 4, are moved onto the new disk, Disk 
4. However, with another operation of adding one more 
disk using the same approach, 1/6 of all the blocks are 
not evenly picked from the 5 old disks and moved onto 
the new disk, Disk 5. Only certain blocks from disks 1, 3 
and 4 are moved onto disk 5 while disk 0 and disk 2 are 
ignored. This is because disk 5 will contain blocks with 
logical numbers that satisfy (x mod 6) = 5, which are 
all odd numbers. The logical numbers of those blocks 
on Disks 0 and 2, resulting from (x mod 4) = 0 and 
(x mod 4) = 2 respectively, are all even numbers. There- 
fore, blocks from disks 0 and 2 do not qualify and are not 
moved. 

Minimal data migration. Figure 6 plots the migra- 
tion fraction (i.e., the fraction of data blocks to be mi- 
grated) versus the number of scaling operations. Using 
the round-robin algorithm, the migration fraction is con- 
stantly 100%. This will bring a very large migration cost. 

The migration fractions using the semi-RR algorithm 
and using FastScale are identical. They are significantly 
smaller than the migration fraction of using the round- 
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Figure 5: Comparison in uniform data distribution 




















104 +~4+—4—+—4—4—4+—+—-+4+-4 
0.8 
2 : 
5 —+— Round-Robin 
x 0.6 —x— FastScale 
2 —o— Semi-RR 
© 0.4 
Qa” 
= 
0.2 Oo 
OD —D— rg 
0.0-+ T T T T T 7 T T T T 1 
0 1 2 3 4 5 6 7 8 9 10 11 


Times of Disk Additions 


Figure 6: Comparison in data migration ratio 


robin algorithm. Another obvious phenomenon is that 
they decrease with the increase of the number of scaling 
operations. The reason behind this phenomenon is de- 
scribed as follows. To make each new disk hold 1/(m+ 
n) of total data, the semi-RR algorithm and FastScale 
moves n/(m-+n) of total data. m increases with the 
number of scaling operations. As a result, the percent- 
age of new disks (i.e., n/(m-+n)) decreases. Therefore, 
the migration fractions using the semi-RR algorithm and 
FastScale decrease. 

Storage and calculation overheads. When a disk ar- 
ray boots, it needs to obtain the RAID topology from 
disks. Table 4 shows the storage overheads of the three 
algorithms. The round-robin algorithm depends only on 
the total number of member disks. So its storage over- 
head is one integer. The semi-RR and FastScale algo- 
rithms depend on how many disks are added during each 
scaling operation. If we scale RAID f times, their stor- 
age overheads are ¢ integers. Actually, the RAID scaling 
operation is not too frequent. It may be performed ev- 
ery half year, or even longer. Consequently, the storage 
overheads are very small. 

To quantitatively characterize the calculation over- 
heads, we run different algorithms to calculate the phys- 
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Figure 7: Comparison in addressing time 











Algorithm | Storage Overhead 
round-robin 1 
semi-RR t 
FastScale t 








Table 4: The storage overheads of different algorithms. 


ical addresses for all data blocks on a scaled RAID. The 


whole addressing process 


is timed and then the average 


addressing time for each block is calculated. The testbed 
used in the experiment is an Intel Dual Core T9400 2.53 
GHz machine with 4 GB of memory. A Windows 7 En- 
terprise Edition is installed. Figure 7 plots the addressing 


time versus the number of 


scaling operations. 


The round-robin algorithm has a low calculation over- 
head of 0.014 us or so. The calculation overheads us- 
ing the semi-RR and FastScale algorithms are close, and 
both take on an upward trend. Among the three algo- 
rithms, FastScale has the largest overhead. Fortunately, 
the largest addressing time using FastScale is 0.24 us 
which is negligible compared to milliseconds of disk I/O 


time. 


3 Optimizing Data 


The FastScale algorithm 


Migration 


succeeds in minimizing data 


migration for RAID scaling. In this section, we describe 


FastScale’s optimizations 
tion. 


to the process of data migra- 


3.1 Access Aggregation 


FastScale moves only data blocks from old disks to new 
disks, while not migrating data among old disks. The 
data migration will not overwrite any valid data. As a 
result, data blocks may be moved in an arbitrary order. 
Since disk I/O performs much better with large sequen- 
tial access, FastScale accesses multiple successive blocks 


via a single I/O. 
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Figure 8: Aggregate reads for RAID scaling from 3 disks to 5. Multiple 
successive blocks are read via a single I/O. 
















































































Figure 9: Aggregate writes for RAID scaling from 3 disks to 5. Multi- 
ple successive blocks are written via a single I/O. 





Take a RAID scaling from 3 disks to 5 as an exam- 
ple, shown in Figure 8. Let us focus on the first region. 
FastScale issues the first I/O request to read Blocks 0 and 
3, the second request to read Blocks 4 and 7, and the 
third request for Blocks 8 and 11, simultaneously. By 
this means, to read all of these blocks, FastScale requires 
only three I/Os, instead of six. Furthermore, all these 3 
large-size data reads are on three disks. They can be done 
in parallel, further increasing I/O rate. 

When all the six blocks have been read into a mem- 
ory buffer, FastScale issues the first I/O request to write 
Blocks 0, 3, and 7, the second I/O to write Blocks 4, 8 
and 11, simultaneously (see Figure 9). In this way, only 
two large sequential write requests are issued as opposed 
to six small writes. 

For RAID scaling from m disks to m+n, m reads and 
n writes are required to migrate all the data in a region, 
1.e., m X n data blocks. 

Access aggregation converts sequences of small re- 
quests into fewer, larger requests. As a result, seek cost 
is mitigated over multiple blocks. Moreover, a typical 
choice of the optimal block size for RAID is 32KB or 
64KB [4, 7, 8, 9]. Thus, accessing multiple successive 
blocks via a single I/O enables FastScale to have a larger 
throughput. Since data densities in disks increase at a 
much faster rate than improvements in seek times and ro- 
tational speeds, access aggregation benefits more as tech- 
nology advances. 


3.2 Lazy Checkpoint 


While data migration is in progress, the RAID storage 
serves user requests. Furthermore, the coming user I/Os 
may be write requests to migrated data. As a result, 
if mapping metadata does not get updated until all of 
the blocks have been moved, data consistency may be 
destroyed. Ordered operations [9] of copying a data 
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Figure 10: If data blocks are copied to their new locations and meta- 
data is not yet updated when the system fails, data consistency is still 
maintained because the data in their original locations are valid and 
available. 
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Figure 11: Lazy updates of mapping metadata. “C”: migrated and 
checkpointed; ““M”: migrated but not checkpointed; “U”:not migrated. 
Data redistribution is checkpointed only when a user write request ar- 
rives in the area “M”. 


block and updating the mapping metadata (a.k.a., check- 
point) can ensure data consistency. But ordered opera- 
tions cause each block movement to require one meta- 
data write, which results in a large cost of data migra- 
tion. Because metadata is usually stored at the beginning 
of all member disks, each metadata update causes one 
long seek per disk. FastScale uses lazy checkpoint to 
minimize the number of metadata writes without com- 
promising data consistency. 

The foundation of lazy checkpoint is described as fol- 
lows. Since block copying does not overwrite any valid 
data, both its new replica and original are valid after a 
data block is copied. In the above example, we suppose 
that Blocks 0, 3, 4, 7, 8, and 11 have been copied to their 
new locations and the mapping metadata has not been up- 
dated (see Figure 10), when the system fails. The origi- 
nal replicas of the six blocks will be used after the system 
reboots. As long as Blocks 0, 3, 4, 7, 8, and 11 have not 
been written since being copied, the data remain consis- 
tent. Generally speaking, when the mapping information 
is not updated immediately after a data block is copied, 
an unexpected system failure only wastes some data ac- 
cesses, but does not sacrifice data reliability. The only 
threat is the incoming of write operations to migrated 
data. 

The key idea behind lazy checkpoint is that data blocks 
are copied to new locations continuously, while the map- 
ping metadata is not updated onto the disks (a.k.a., check- 
point) until a threat to data consistency appears. We use 


h;(x) to describe the geometry after the i” scaling opera- 
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Figure 12: Simulation system block diagram: The workload generator 
and the array controller were implemented in SimPy. DiskSim was 
used as a worker module to simulate disk accesses. 


tion, where N; disks serve user requests. Figure 11 illus- 
trates an overview of the migration process. Data in the 
moving region is copied to new locations. When a user 
request arrives, if its physical block address is above the 
moving region, it is mapped with /;_; (x); If its physical 
block address is below the moving region, it is mapped 
with h;(x). When all of the data in the current moving 
region are moved, the next region becomes the moving 
region. In this way, the newly added disks are gradually 
available to serve user requests. Only when a user write 
request arrives in the area where data have been moved 
and the movement has not been checkpointed, are map- 
ping metadata updated. 


Since one write of metadata can store multiple map 
changes of data blocks, lazy updates can significantly 
decrease the number of metadata updates, reducing the 
cost of data migration. Furthermore, lazy checkpoint 
can guarantee data consistency. Even if the system fails 
unexpectedly, only some data accesses are wasted. It 
should also be noted that the probability of a system fail- 
ure is very low. 


4 Experimental Evaluation 


The experimental results in Section 2.4 show that the 
semi-RR algorithm causes extremely non-uniform data 
distribution. This will bring into low I/O performance. 
In this section, we compare FastScale with the SLAS 
approach [5] through detailed experiments. SLAS, pro- 
posed in 2007, preserves the round-robin order after 
adding disks. 
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4.1 Simulation System 


We use detailed simulations with several disk traces col- 
lected in real systems. The simulator is made up of a 
workload generator and a disk array (Figure 12). Ac- 
cording to trace files, the workload generator initiates an 
I/O request at the appropriate time so that a particular 
workload is induced on the disk array. 

The disk array consists of an array controller and stor- 
age components. The array controller is logically divided 
into two parts: an I/O processor and a data mover. The 
I/O processor, according to the address mapping, for- 
wards incoming I/O requests to the corresponding disks. 
The data mover reorganizes the data on the array. The 
mover uses an on/off logic to adjust the redistribution 
rate. Data redistribution is throttled on detection of high 
application workload. Otherwise, it performs continu- 
ously. 

The simulator is implemented in SimPy [10] and 
DiskSim [11]. SimPy is an object-oriented, process- 
based discrete-event simulation language based on stan- 
dard Python. DiskSim is an efficient, accurate disk sys- 
tem simulator from Carnegie Mellon University and has 
been extensively used in various research projects study- 
ing storage subsystem architectures. The workload gen- 
erator and the array controller are implemented in SimPy. 
Storage components are implemented in DiskSim. In 
other words, DiskSim is used as a worker module to sim- 
ulate disk accesses. The simulated disk specification is 
that of the 15,000-RPM IBM Ultrastar 36Z15 [12]. 


4.2 Workloads 


Our experiments use the following three real-system disk 
I/O traces with different characteristics. 


e TPC-C traced disk accesses of the TPC-C database 
benchmark with 20 warehouses [13]. It was col- 
lected with one client running 20 iterations. 


e Fin is obtained from the Storage Performance 
Council (SPC) [14, 15], a vendor-neutral standards 
body. The Fin trace was collected from OLTP appli- 
cations running at a large financial institution. The 
write ratio is high. 


e Web is also from SPC. It was collected from a 
system running a web search engine. The read- 
dominated Web trace exhibits the strong locality in 
its access pattern. 


4.3 Experiment Results 


4.3.1 The Scaling Efficiency 


Each experiment lasts from the beginning to the end of 
data redistribution for RAID scaling. We focus on com- 
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Figure 13: Performance comparison between FastScale and SLAS un- 
der the Fin workload. 


paring redistribution times and user I/O latencies when 
different scaling programs are running in background. 

In all experiments, the sliding window size for SLAS 
is set to 1024. Access aggregation in SLAS can improve 
the redistribution efficiency. However, a too large size of 
redistribution I/Os will compromise the I/O performance 
of applications. In our experiments, SLAS reads 8 data 
blocks via an I/O request. 

The purpose of our first experiment is to quantitatively 
characterize the advantages of FastScale through a com- 
parison with SLAS. We conduct a scaling operation of 
adding 2 disks to a 4-disk RAID, where each disk has 
a capacity of 4 GB. Each approach performs with the 
32KB stripe unit size under a Fin workload. The thresh- 
old of rate control is set 100 IOPS. This parameter setup 
acts as the baseline for the latter experiments, from which 
any change will be stated explicitly. 

We collect the latencies of all user I/Os. We divide 
the I/O latency sequence into multiple sections accord- 
ing to I/O issuing time. The time period of each section 
is 100 seconds. Furthermore, we get a local maximum 
latency from each section. A local maximum latency is 
the maximum of I/O latencies in a section. Figure 13 
plots local maximum latencies using the two approaches 
as the time increases along the x-axis. It illustrates that 
FastScale demonstrates a noticeable improvement over 
SLAS in two metrics. First, the redistribution time using 
FastScale is significantly shorter than that using SLAS. 
They are 952 seconds and 6,830 seconds, respectively. 
In other words, FastScale has a 86.06% shorter redistri- 
bution time than SLAS. 

The main factor in FastScale’s reducing the redistribu- 
tion time is the significant decline of the amount of the 
data to be moved. When SLAS is used, almost 100% 
of data blocks have to be migrated. However, when 
FastScale is used, only 33.3% of data blocks require to 
be migrated. Another factor is the effective exploitation 
of two optimization technologies: access aggregation re- 
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Figure 14: Cumulative distribution of I/O latencies during the data re- 
distributions by the two approaches under the Fin workload. 


duces the number of redistribution I/Os; lazy checkpoint 
minimizes metadata writes. 

Second, local maximum latencies of SLAS are obvi- 
ously longer than those of FastScale. The global max- 
imum latency using SLAS reaches 83.12 ms while that 
using FastScale is 55.60 ms. This is because the redis- 
tribution I/O size using SLAS is larger than that using 
FastScale. For SLAS, the read size is 256 KB (8 blocks), 
and the write size is 192 KB (6 blocks). For FastScale, 
the read size is 64 KB (2 blocks), and the write size is 128 
KB (4 blocks). Of course, local maximum latencies of 
SLAS will be lower with a decrease in the redistribution 
I/O size. But the decrease in the I/O size will necessarily 
enlarge the redistribution time. 

Figure 14 shows the cumulative distribution of user re- 
sponse times during data redistribution. To provide a fair 
comparison, I/Os involved in statistics for SLAS are only 
those issued before 952 seconds. When I/O latencies are 
larger than 18.65 ms, the CDF value of FastScale is larger 
than that of SLAS. This indicates again that FastScale 
has smaller maximum response time of user I/Os than 
SLAS. The average latency of FastScale is close to that 
of SLAS. They are 8.01 ms and 7.53 ms respectively. It 
is noteworthy that due to significantly shorter data redis- 
tribution time, FastScale has a markedly smaller impact 
on the user I/O latencies than SLAS does. 

A factor that might affect the benefits of FastScale is 
the workload under which data redistribution performs. 
Under the TPC-C workload, we also measure the per- 
formances of FastScale and SLAS to perform the “4+2” 
scaling operation. 

For the TPC-C workload, Figure 15 shows local max- 
imum latencies versus the redistribution times for SLAS 
and FastScale. It shows once again the efficiency of 
FastScale in improving the redistribution time. The re- 
distribution times using SLAS and FastScale are 6,820 
seconds and 964 seconds, respectively. That is to say, 
FastScale brings an improvement of 85.87% in the re- 
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Figure 15: Performance comparison between FastScale and SLAS un- 
der the TPC-C workload. 


distribution time. Likewise, local maximum latencies of 
FastScale are also obviously shorter than those of SLAS. 
The global maximum latency using FastScale is 114.76 
ms while that using SLAS reaches 147.82 ms. 

To compare the performance of FastScale under dif- 
ferent workloads, Figure 16 shows a comparison in the 
redistribution time between FastScale and SLAS. For 
completeness, we also conducte a comparison experi- 
ment on the redistribution time with no loaded work- 
load. To scale a RAID volume off-line, SLAS uses 6802 
seconds whereas FastScale consumes only 901 seconds. 
FastScale provides an improvement of 86.75% in the re- 
distribution time. 

We can draw one conclusion from Figure 16. Under 
various workloads, FastScale consistently outperformes 
SLAS by 85.87-86.75% in the redistribution time, with 
smaller maximum response time of user I/Os. 


4.3.2 The Performance after Scaling 


The above experiments show that FastScale improves the 
scaling efficiency of RAID significantly. One of our con- 
cerns is whether there is a penalty in the performance of 
the data layout after scaling using FastScale, compared 
with the round-robin layout preserved by SLAS. 

We use the Web workload to measure the perfor- 
mances of the two RAIDs, scaled from the same RAID 
using SLAS and FastScale. Each experiment lasts 500 
seconds, and records the latency of each I/O. Based on 
the issue time, the I/O latency sequence is divided into 
20 sections evenly. Furthermore, we get a local average 
latency from each section. 

First, we compare the performances of two RAIDs, 
after one scaling operation “4+1” using the two scaling 
approaches. Figure 17 plots local average latencies for 
the two RAIDs as the time increases along the x-axis. 
We can find that the performances of the two RAIDs are 
very close. With regards to the round-robin RAID, the 
average latency is 11.36 ms. For the FastScale RAID, 
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Figure 16: Comparison of redistribution times of FastScale and SLAS 
under different workloads. The label “unloaded” means scaling a 
RAID volume offline. 


the average latency is 11.37 ms. 

Second, we compare the performances of two RAIDs, 
after two scaling operations “4+1+1” using the two ap- 
proaches. Figure 18 plots local average latencies of the 
two RAIDs as the time increases along the x-axis. It 
again revealed the approximate equality in the perfor- 
mances of the two RAIDs. With regards to the round- 
robin RAID, the average latency is 11.21 ms. For the 
FastScale RAID, the average latency is 11.03 ms. 

One conclusion can be reached that the performance 
of the RAID scaled using FastScale is almost identical 
with that of the round-robin RAID. 


5 Related Work 


5.1 Scaling Deterministic RAID 


The HP AutoRAID [8] allows an online capacity expan- 
sion. Newly created RAID-5 volumes use all of the disks 
in the system, but previously created RAID-5 volumes 
continue to use only the original disks. This expansion 
does not require data migration. But the system cannot 
add new disks into an existing RAID-5 volume. The con- 
ventional approaches to RAID scaling redistributes data 
and preserves the round-robin order after adding disks. 

Gonzalez and Cortes [3] proposed a gradual assimila- 
tion algorithm (GA) to control the overhead of scaling a 
RAID-5 volume. However, GA accesses only one block 
via an I/O. Moreover, it writes mapping metadata onto 
disks immediately after redistributing each stripe. As a 
result, GA has a large redistribution cost. 

The reshape toolkit in the Linux MD driver (MD- 
Reshape) [4] writes mapping metadata for each fixed- 
sized data window. However, user requests to the data 
window have to queue up until all data blocks within the 
window are moved. On the other hand, MD-Reshape is- 
sues very small (4KB) I/O operations for data redistri- 
bution. This limits the redistribution performance due to 


USENIX Association 


USENIX Association 


J @. 
= eng, J Netto 9 


10-4 Vv 


B=e_ Ng -0-0* 




















a 

— 

p> 84 

oO 

ij 

2 

& 

o °° 

D 

© + 

So 4] —o— round-robin 

© —e— FastScale 
2 
0 T T T T T T 

0 100 200 300 400 500 


timeline (s) 


Figure 17: Performance comparison between FastScale’s layout and 
round-robin layout under the Web workload after one scaling operation 
“A+1”. 


more disk seeks. 

Zhang et al. [5] discovered that there is always a re- 
ordering window during data redistribution for round- 
robin RAID scaling. The data inside the reordering win- 
dow can migrate in any order without overwriting any 
valid data. By leveraging this insight, they proposed the 
SLAS approach, improving the efficiency of data redis- 
tribution. However, SLAS still requires migrating all 
data. Therefore, RAID scaling remains costly. 

D-GRAID [16] restores only live file system data to a 
hot spare so as to recover from failures quickly. Like- 
wise, it can accelerate the redistribution process if only 
the live data blocks from the perspective of file systems 
are redistributed. However, this needs for semantically- 
smart storage systems. Differently, FastScale is indepen- 
dent on file systems, and it can work with any ordinary 
disk storage. 

A patent [17] presents a method to eliminate the need 
to rewrite the original data blocks and parity blocks on 
original disks. However, the method makes all the parity 
blocks be either only on original disks or only on new 
disks. The obvious distribution non-uniformity of parity 
blocks will bring a penalty to write performance. 

Franklin et al. [18] presented an RAID scaling method 
using spare space with immediate access to new space. 
First, old data are distributed among the set of data disk 
drives and at least one new disk drive while, at the same 
time, new data are mapped to the spare space. Upon 
completion of the distribution, new data are copied from 
the spare space to the set of data disk drives. This is simi- 
lar to the key idea of WorkOut [19]. This kind of method 
requires spare disks available in the RAID. 

In another patent, Hetzler [20] presented a method to 
RAID-5 scaling, noted MDM. MDM exchanges some 
data blocks between original disks and new disks. MDM 
can perform RAID scaling with reduced data movement. 
However, it does not increase (just maintains) the data 
storage efficiency after scaling. The RAID scaling pro- 
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Figure 18: Performance comparison between FastScale’s layout and 
round-robin layout under the Web workload after two scaling opera- 
tions “4+1+1”. 


cess exploited by FastScale is favored in the art because 
the data storage efficiency is maximized, which many 
practitioners consider desirable. 


5.2 Scaling Randomized RAID 


Randomized RAID [6, 21, 22, 23] appears to have bet- 
ter scalability. It is now gaining the spotlight in the data 
placement area. Brinkmann et al. [23] proposed the cut- 
and-paste placement strategy that uses randomized allo- 
cation strategy to place data across disks. For a disk ad- 
dition, it cuts off the range [1/(n+1),1/n] from given 
n disks, and pastes them to the newly added (n + 1)"" 
disk. For a disk removal, it uses reversing operation to 
move all the blocks in disks that will be removed to the 
other disks. Also based on random data placement, Seo 
and Zimmermann [24] proposed an approach to finding a 
sequence of disk additions and removals for the disk re- 
placement problem. The goal is to minimize the data mi- 
gration cost. Both these two approaches assume the exis- 
tence of a high-quality hash function that assigns all the 
data blocks in the system into the uniformly distributed 
real numbers with high probability. However, they did 
not present such a hash function. 

The SCADDAR algorithm [6] uses a pseudo-random 
function to distribute data blocks randomly across all 
disks. It keeps track of the locations of data blocks after 
multiple disk reorganizations and minimizes the amount 
of data to be moved. Unfortunately, the pseudo-hash 
function does not preserve the randomness of the data 
layout after several disk additions or deletions [24]. So 
far, true randomized hash function which preserves its 
randomness after several disk additions or deletions has 
not been found. 

The simulation report in [21] shows that a single copy 
of data in random striping may result in some hiccups of 
the continuous display. To address this issue, one can use 
data replication [22], where a fraction of the data blocks 
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randomly selected are replicated on randomly selected 
disks. However, this will bring into a large capacity over- 
head. 

RUSH [25, 26] and CRUSH [27] are two algorithms 
for online placement and reorganization of replicated 
data. They are probabilistically optimal in distributing 
data evenly and minimizing data movement when new 
storage is added to the system. There are three differ- 
ences between them and FastScale. First, they depend on 
the existence of a high-quality random function, which 
is difficult to generate. Second, they are designed for 
object-based storage systems. They focus on how a data 
object is mapped to a disk, without considering the data 
layout of each individual disk. Third, our mapping func- 
tion needs to be 1-1 and onto, but hash functions have 
collisions and count on some amount of sparseness. 


6 Conclusion and Future Work 


This paper presents FastScale, a new approach that ac- 
celerates RAID-0 scaling by minimizing data migra- 
tion. First, with a new and elastic addressing function, 
FastScale minimizes the number of data blocks to be mi- 
grated without compromising the uniformity of data dis- 
tribution. Second, FastScale uses access aggregation and 
lazy checkpoint to optimize data migration. 

Our results from detailed experiments using real- 
system workloads show that, compared with SLAS, a 
scaling approach proposed in 2007, FastScale can reduce 
redistribution time by up to 86.06% with smaller maxi- 
mum response time of user I/Os. The experiments also 
illustrate that the performance of the RAID scaled using 
FastScale is almost identical with that of the round-robin 
RAID. 

In this paper, the factor of data parity is not taken into 
account. we believe that FastScale provides a good start- 
ing point for efficient scaling of RAID-4 and RAID-5 ar- 
rays. In the future, we will focus on extending FastScale 
to RAID-4 and RAID-5. 


7 Acknowledgements 


We are indebted to the anonymous reviewers of the pa- 
per for their insightful comments. We are also grate- 
ful to Dr. Benjamin Reed, our shepherd, for detailed 
comments and suggestions that greatly improved the 
readability of the paper. This work was supported by 
the National Natural Science Foundation of China un- 
der Grant 60903183, the National High Technology Re- 
search and Development Program of China under Grant 
No. 2009AA01A403, and the National Grand Funda- 
mental Research 973 Program of China under Grant No. 
2007CB311100. 


FAST °11: 9th USENIX Conference on File and Storage Technologies 


References 


(1] 


[2 


[3 


[4 


[5 





[6 


[7] 


[8] 


[9] 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


D. A. Patterson, G. A. Gibson, R. H. Katz. A Case for Redundant 
Arrays of Inexpensive Disks (RAID), in Proceedings of the In- 
ternational Conference on Management of Data (SIGMOD’ 88), 
June 1988. pp. 109-116. 

D. A. Patterson. A simple way to estimate the cost of down-time. 
In Proceedings of the 16th Large Installation Systems Adminis- 
tration Conference (LISA’02), October 2002. pp. 185-188. 

J. Gonzalez and T. Cortes. Increasing the capacity of RAIDS by 
online gradual assimilation. In Proceedings of the International 
Workshop on Storage Network Architecture and Parallel I/Os. 
Antibes Juan-les-pins, France, Sept. 2004 

N. Brown. Online RAID-S resizing. drivers/md/ raid5.c in the 
source code of Linux Kernel 2.6.18. http://www.kernel.org/. 
September 2006. 

G. Zhang, J. Shu, W. Xue, and W. Zheng. SLAS: An efficient 
approach to scaling round-robin striped volumes. ACM Trans. 
Storage, volume 3, issue 1, Article 3, 1-39 pages. March 2007. 
A. Goel, C. Shahabi, S-YD Yao, R. Zimmermann. SCADDAR: 
An efficient randomized technique to reorganize continuous me- 
dia blocks. In Proceedings of the 18th International Conference 
on Data Engineering (ICDE’02). San Jose, 2002. pp. 473-482. 
J. Hennessy and D. Patterson. Computer Architecture: A Quan- 
titative Approach, 3rd ed. Morgan Kaufmann Publishers, Inc., 
San Francisco, CA, 2003. 

J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP 
AutoRAID hierarchical storage system. ACM Transactions on 
Computer Systems, volume 14, issue 1, pp. 108-136, February 
1996. 

C. Kim, G. Kim, and B. Shin. Volume management in SAN envi- 
ronment. In Proceedings of the 8th International Conference on 
Parallel and Distributed Systems, ICPADS’01. 2001. pp. 500- 
505. 

Klaus Muller, Tony Vignaux. SimPy 2.0.1’s documenta- 
tion. http://simpy.sourceforge.net/SimPyDocs/index.html. last 
accessed on April, 2009. 

J. Bucy, J. Schindler, S. Schlosser, G. Ganger. The DiskSim Sim- 
ulation Environment Version 4.0 Reference Manual. Tech. report 
CMU-PDL-08-101, Carnegie Mellon University. 2008. 
Hard disk drive specifications —_ Ultrastar 
http://www.hitachigst.com/tech/techlib.nsf/techdocs/ 
85256AB8006A31E587256A7800739FEB/$file/U36Z15_sp10.PDF. 
Revision 1.0, April, 2001. 

TPC-C. Postgres. 20 iterations. DTB v1.1. Performance Eval- 
uation Laboratory, Brigham Young University. Trace distribu- 


tion center. http://tds.cs.byu.edu/tds/, last accessed on Decem- 
ber, 2010. 


36Z15. 


OLTP Application VO and Search En- 
gine Vo. UMass Trace Repository. 
http://traces.cs.umass.edu/index.php/Storage/Storage. June, 
2007. 

Storage Performance Council. 
http://www.storageperformance.org/home. last accessed 


on December, 2010. 


Muthian Sivathanu , Vijayan Prabhakaran , Andrea C. Arpaci- 
Dusseau , Remzi H. Arpaci-Dusseau. Improving Storage System 
Availability with D-GRAID, In Proceedings of the 3rd USENIX 
Conference on File and Storage Technologies (FAST’04), San 
Francisco, CA. March 2004. 

C.B. Legg, Method of Increasing the Storage Capacity of a Level 
Five RAID Disk Array by Adding, in a Single Step, a New Parity 
Block and N-1 New Data Blocks Which Respectively Reside in 
anew Columns, Where N Is at Least Two, US Patent: 6000010, 
December 1999. 


USENIX Association 


[18] 


[19] 


[20] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


USENIX Association 


C.R Franklin and J.T. Wong, Expansion of RAID Subsystems 
Using Spare Space with Immediate Access to New Space, US 
Patent 10/033,997, 2006. 

Suzhen Wu, Hong Jiang, Dan Feng, Lei Tian, and Bo Mao, 
WorkOut: I/O Workload Outsourcing for Boosting the RAID 
Reconstruction Performance, In Proceedings of the 7th USENIX 
Conference on File and Storage Technologies (FAST ’09), San 
Francisco, CA, USA, pp. 239-252. February 2009. 

S.R. Hetzler, Data Storage Array Scaling Method and System 
with Minimal Data Movement, US Patent 20080276057, 2008. 
J. Alemany and J. S. Thathachar. Random striping news on de- 
mand servers. Tech. Report, TR-97-02-02, University of Wash- 
ington, 1997. 

Jose Renato Santos, Richard R. Muntz, and Berthier A. Ribeiro- 
Neto. Comparing random data allocation and data striping in 
multimedia servers. In Measurement and Modeling of Computer 
Systems, pp. 44-55. 2000. 

Andre Brinkmann, Kay Salzwedel, and Christian Scheideler. Ef- 
ficient, distributed data placement strategies for storage area net- 
works (extended abstract). In ACM Symposium on Parallel Al- 
gorithms and Architectures, pp. 119-128. 2000. 

Beomjoo Seo and Roger Zimmermann. Efficient disk replace- 
ment and data migration algorithms for large disk subsystems. 
ACM Transactions on Storage (TOS), volume 1, issue 3, pages 
316-345, August 2005. 

R. J. Honicky and E. L. Miller. A fast algorithm for online place- 
ment and reorganization of replicated data. In Proceedings of the 
17th International Parallel and Distributed Processing Sympo- 
sium (IPDPS 2003), Nice, France, April 2003. 

R. J. Honicky and E. L. Miller. Replication under scalable hash- 
ing: A family of algorithms for scalable decentralized data dis- 
tribution. In Proceedings of the 18th International Parallel and 
Distributed Processing Symposium (IPDPS’04), IEEE. 2004 

S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn. CRUSH: 
Controlled, Scalable, Decentralized Placement of Replicated 
Data. In Proceedings of the International Conference on Super 
Computing (SC’06). Tampa Bay, FL. 2006. 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


161 


The SCADS Director: Scaling a Distributed Storage System Under 
Stringent Performance Requirements 


Beth Trushkowsky, Peter Bodik, Armando Fox, 
Michael J. Franklin, Michael I. Jordan, David A. Patterson 
{trush, bodik, fox, franklin, jordan, pattrsn} @ eecs.berkeley.edu 
University of California, Berkeley 


USENIX Association 


Abstract 


Elasticity of cloud computing environments provides an 
economic incentive for automatic resource allocation of 
stateful systems running in the cloud. However, these 
systems have to meet strict performance Service-Level 
Objectives (SLOs) expressed using upper percentiles of 
request latency, such as the 99th. Such latency measure- 
ments are very noisy, which complicates the design of 
the dynamic resource allocation. We design and evaluate 
the SCADS Director, a control framework that reconfig- 
ures the storage system on-the-fly in response to work- 
load changes using a performance model of the system. 
We demonstrate that such a framework can respond to 
both unexpected data hotspots and diurnal workload pat- 
terns without violating strict performance SLOs. 


1 Introduction 


Cloud computing has emerged as a preferred technology 
for delivering large-scale internet applications, in part be- 
cause its elasticity provides the ability to dynamically 
provision and reclaim resources in response to fluctua- 
tions in workload. As cloud environments and their ap- 
plications expand in scale and complexity, it becomes in- 
creasingly important to automate such dynamic resource 
allocation. 

Techniques for automatically scaling stateless systems 
such as web servers or application servers are fairly well 
understood. However, many applications that can most 
benefit from elasticity, such as social networking, e- 
commerce and auction sites, are both data-intensive and 
interactive. Such applications present three major chal- 
lenges for automatic scaling. 

First, in most data-intensive services, a request for a 
specific data item can only be satisfied by a copy of that 
particular data item, so not every server can handle every 
request, which complicates load balancing. Second, in- 
teractivity means that a successful application must pro- 
vide highly-responsive, low-latency service to the vast 
majority of users: a typical Service Level Objective 


(SLO) might be expressed as “99% of all requests must 
be answered within 100ms” [20, 17]. Third, the work- 
loads presented by large-scale applications can be highly 
volatile, with quickly-occurring unexpected spikes (due 
to flash crowds) and diurnal fluctuations. 

This “perfect storm” of statefulness, workload volatil- 
ity and stringent performance requirements complicates 
the development of automatic scaling mechanisms. To 
scale a data-intensive system, data items must be moved 
(i.e., partitioned or coalesced) or copied (i.e., replicated) 
among the nodes of the system. Such data movement 
takes time and can place additional load on an already 
overloaded system. Provisioning of new nodes incurs 
significant start-up delay, so decisions must be made 
early to react effectively to workload changes. But most 
importantly, the SLOs on upper percentile latency sig- 
nificantly complicate the problem compared to require- 
ments based on average latency, as statistical estimates 
based on observations in the upper percentiles of the la- 
tency distribution have higher variance than estimates 
obtained from the center of the distribution. This vari- 
ance is exacerbated by “environmental” application noise 
uncorrelated to particular queries or data items [19]. The 
resulting noisy latency signal can cause oscillations in 
classical closed-loop control [7]. 

In this paper we describe the design of a control frame- 
work for dynamically scaling distributed storage systems 
that addresses these challenges. Our approach leverages 
key features of modern distributed storage systems and 
uses a performance model coupled with workload statis- 
tics to predict whether each server is likely to continue to 
meet its SLO. Based on this model, the framework moves 
and replicates data as necessary. In particular, we make 
the following contributions: 


e We identify the challenges and opportunities that 
arise in designing dynamic resource allocation 
frameworks for stateful systems that maintain perfor- 
mance SLOs on upper quantiles of request latency. 
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e We describe the design and implementation of 
a modular control framework based on Model- 
Predictive Control [30] that addresses these chal- 
lenges. 


e We evaluate the effectiveness of the control frame- 
work through experiments using a storage system 
running on Amazon’s Elastic Compute Cloud (EC2), 
using workloads that exhibit both periodic and erratic 
fluctuations comparable to those observed in produc- 
tion systems. 


The rest of the paper proceeds as follows. Section 2 
describes background and challenges, and Section 3 dis- 
cusses the design considerations that address those chal- 
lenges. Related work is in Section 4. Section 5 details 
the implementation of our control framework, and Sec- 
tion 6 demonstrates experimental results of the control 
framework using Amazon’s EC2. Further discussion is 
in Section 7, and we remark on future work and conclude 
in Sections 8 and 9. 


2 Scaling Challenges 
2.1 Background 


We address dynamic resource allocation for distributed 
storage systems for which the performance SLO is spec- 
ified using an upper percentile of latency. The goal is to 
design a control framework that tries to avoid SLO vio- 
lations, while keeping the cost of leased resources low. 

Our solution is targeted for storage systems designed 
for horizontal scalability, such as key-value stores, that 
back interactive web applications. Examples of such 
systems are PNUTS [17], BigTable [14], Cassandra [3], 
SCADS [6], and HBase [4]. Requests in these systems 
have a simple communication pattern; each system at 
minimum provides get and put functionality on keys, 
and each request is single unit of work. We take advan- 
tage of this simplicity in our approach. 

This simplified model also lends itself to easy parti- 
tioning of the key space across multiple servers, typi- 
cally using a hash or range partitioning scheme. Each 
server node stores a subset of the data and serves re- 
quests for that subset. The control framework has two 
knobs: it can partition or replicate data to prevent servers 
from being overloaded when workload increases (e.g. 
due to diurnal variation or hotspots), or it can coalesce 
data and remove unnecessary replicas when the work- 
load decreases. To make these configuration changes, 
the underlying storage system must be easy to recon- 
figure on-the-fly. Specifically, we require that it al- 
lows data to be copied from one server to another or 
deleted from a server, and that it provides methods like 
AddServer and RemoveServer to alter the num- 
ber of leased servers. We previously designed and built 
SCADS [6] to both support this functionality and pro- 
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Figure 1: Standard deviation for the mean and 99th 


percentile of latency for increasing smoothing window 
sizes. The left-most points represent the raw measure- 
ments over 20-second periods. The average of the mean 
and 99th percentile latencies are 11 ms and 82 ms, re- 
spectively. 


vide the simple communication pattern described above. 
As we further discuss in Section 7, running our own 
key-value store in the cloud has advantages over using 
a cloud-provided data service such as Amazon’s S3. 

SCADS was designed to keep data memory-resident 
so that applications aren’t required to use ad-hoc caching 
techniques to reach performance goals. This design pro- 
vides similar performance benefits as Memcached; how- 
ever, SCADS also supports real-time replication and load 
balancing. An example target application would be the 
highly interactive social networking site Facebook.com; 
most of their data remains memory-resident in order to 
hit performance targets [31]. 

In this section, we identify two challenges in scaling a 
storage system while maintaining a high-percentile SLO: 
noise and data movement. Benchmarks are presented to 
show the effects of each of these challenges. 


2.2 Controlling a Noisy Signal 


Figure | shows request latencies achieved by several key- 
value storage servers under a steady workload.'! As ex- 
pected, the standard deviation of both the mean latency 
and 99th percentile latency decreases as we increase the 
smoothing window, or time period over which the mea- 
surements are aggregated. However, as can be seen in 
the figure, the 99th percentile of latency would have 
be to smoothed over a four-minute window to achieve 
the same standard deviation as that achieved by the 
mean smoothed over a 20-second window (an 11x longer 
smoothing window). Similar effects are illustrated in ex- 
periments with Dynamo [20]?. 

This observation has serious consequences if we are 


'The workload consists of get and put requests against the 
SCADS [6] storage system, running on ten Amazon Elastic Compute 
Cloud (EC2) “Small” instances. Details of our experimental setup are 
in Section 6.1. 

2See Figure 4 in [20] 
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Figure 2: Impact on read performance during data copy- 
ing on the write target. The x-axis represents the copy 
rate (in log scale) and the y-axis represents the fraction 
of requests slower than 100 ms (in log scale). 


contemplating using classical closed-loop control. A 
long smoothing window means a longer delay before 
the control loop can make its next decision, resulting in 
more SLO violations. Furthermore, too much smoothing 
could mask a real spike in workload, and the controller 
would not respond at all. A short smoothing window 
mitigates both problems but can lead to oscillatory be- 
havior [7]. Due to the high variance associated with a 
shorter smoothing window, the controller cannot tell if a 
server with high latency is actually overloaded or if it is 
simply exhibiting normally-occurring higher latency. A 
classical closed-loop controller might add servers in one 
iteration just to remove them in the next or may move 
data back and forth unnecessarily in response to such 
“false alarms.” We show in Section 3 that a more ef- 
fective approach is a model-based control in which the 
controller uses a different input signal than the quantity 
it is trying to control. 


2.3. Data Movement Hurts Performance 


Scaling a storage system requires data movement. Be- 
cause each server is responsible for its own state, i.e., 
the data it stores, it is not generally true that any server 
can service any request. Simply adding and removing 
servers is not sufficient to respond to changes in work- 
load, we additionally need to copy and move data be- 
tween servers. However, data movement impacts perfor- 
mance and this impact is especially noticeable in the tail 
of the latency distribution. Impacting the tail of the distri- 
bution is of particular interest since we target upper per- 
centile SLOs. As demonstrated in Figure 2, copying data 
increases the fraction of slow requests. In Dynamo [20], 
the data copy operations are run in low priority mode to 
minimize their impact on performance of interactive op- 
erations. Since one of our operational goals is to respond 
to spikes while minimizing SLO violations, our approach 
instead identifies and copies the smallest amount of data 
needed to relieve SLO pressure. 


3 Design Techniques and Approach 


Having outlined our goals and identified key challenges 
in Section 2, we now describe the design techniques in 
our solution. In particular, we use a model-predictive 
control, fine-grained workload statistics, and replication 
for performance predictability. 


3.1. Model-Predictive Control 


Model-predictive control (MPC) can yield improvements 
over classical closed-loop control systems in the pres- 
ence of noisy signals because the controller takes as in- 
put a different signal than the one it is trying to control. 
In MPC, the controller uses a model of the system and 
its current state to compute the (near) optimal sequence 
of actions that maintain desired constraints. To simplify 
the computation of these actions, MPC considers a short 
receding time horizon. The controller executes only the 
first action in the sequence and then uses the new cur- 
rent state to compute a new sequence of actions. In each 
iteration, the controller reevaluates the system state and 
computes a new target state to adjust to changing condi- 
tions. 

Realizing the improvements of MPC requires con- 
structing an accurate model of the controlled system, 
which can be difficult in general. However, a distributed 
system with simple requests (see Section 2.1) is simpler 
to control: by avoiding per-server SLO violations, the 
controller avoids global violations. 

We use a model of the system that predicts SLO viola- 
tions based on the workload from individual servers. An 
overloaded server is in danger of a violation and needs to 
have data moved away. Similarly, the control framework 
uses the model to estimate how much spare capacity is 
left on an underloaded server, helpful for deciding which 
data should be moved there. Details of our model are in 
Section 5.4. 


3.2 Reduce Data Movement 


Figure 2 demonstrates that data movement negatively im- 
pacts performance. To reduce the amount of data copied 
between servers, we organize data as small units (bins), 
monitor workload to these bins, and move individual bins 
of data. This approach is commonly used to ease load- 
balancing [14, 17]. 

Monitoring workload statistics at a granularity finer 
than per-server is essential for the control framework to 
decide which data should be moved or copied. With- 
out this information, it would be impossible to deter- 
mine the minimal amount of data that could be moved 
from an overloaded server to bring it back to an SLO- 
compliant state. The performance model can predict how 
much “extra room” underloaded servers have, allowing 
the control framework to choose where to move the data. 
A “best-fit” policy that keeps the servers as fully utilized 
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Figure 3: 99th percentile of latency over time measured 
during two experiments with steady workload. We kept 
the workload volume and number of servers the same, 
but changed the replication level from one data copy 
(top) to two (bottom). Horizontal lines representing the 
latencies 50 ms and 100 ms are provided for reference. 


as possible is also important for scaling down leased re- 
sources, as unused servers can be released. Monitoring 
workload on small ranges of data give the control frame- 
work fine-grained information to move as little data as 
necessary to alleviate performance issues and to safely 
coalesce servers so they can be released. 


3.3. Replication for Predictability 


Distributed systems, particularly those operating in a 
cloud environment, typically experience environmental 
noise uncorrelated to a particular query or data [19]. In 
our benchmarks, we saw fluctuations in 99th percentile 
of latency over time and between different servers. 

However, distributed systems also present the oppor- 
tunity to use replication as a means of improving per- 
formance. In Dynamo, setting the read/write quorum 
parameters to be less than the total number of replicas 
achieves better request latency [20]. Another example 
is in the Google File System [21], which writes logs to 
different servers. 

We handle performance perturbations caused by envi- 
ronmental noise by exploiting data replication; replica- 
tion in the cloud environment is useful for performance 
predictability. Each request is sent to multiple replicas of 
the requested item and the first response is sent back to 
the client; this is the technique described in [20]. 

Figure 3 compares using one replica versus two on the 
same number of total servers (ten); shown is the 99th 
percentile of latency over time measured with steady 
workload. Note that the latency using replication is 
both smaller and more stable, even though each of these 
servers is doing more work than a server in the single 
replica scenario. It may seem that using single replicas 
with higher utilization would yield higher overall good- 
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Figure 4: CDFs of 99th percentile latency measured ev- 
ery 20 seconds in three experiments. Each experiment 
yields the same goodput, however using more replicas 
results in lower and less variable latency. 


put (i.e., the amount of useful work accomplished per 
unit time). However, the extra work done by increasing 
the utilization will be in vain if those requests violate the 
SLO. In other words, the stringent SLO lowers the useful 
utilization of a single server. 

Using more replicas yields lower variance in the 99th 
percentile. Figure 4 shows three Cumulative Distribu- 
tion Functions (CDFs) of the 99th percentile of latency 
during three experiments using up to three replicas; each 
experiment yields the same goodput (workload to fully 
load five single replicas). Note the shorter tails on the 
distributions as the replication factor increases. 

An advantage of using replication for performance is 
that it helps mask the effects of data movement during 
dynamic scaling. Thus replication is beneficial for alle- 
viating both naturally-occurring and introduced noise. 

Note that this data replication technique improves the 
99th percentile latency from the perspective of the client, 
but does not reduce variance of the upper percentiles of 
latency of requests from an individual server. Therefore, 
the need for model-based control due to the difficulty in 
controlling a noisy signal remains present. 


4 Related Work 


Previous projects have addressed various subsets of our 
problem space, but to our knowledge none tackle the en- 
tire problem of the online control of the upper percentiles 
of latency in stateful, distributed systems. 

Some work [2, 33] aims to optimize the static provi- 
sioning of a storage system before deploying to produc- 
tion. They search the configuration space for a cluster 
configuration that optimizes a specified utility function, 
but this optimization is done offline and performance is 
not considered during the re-configuration. 

Other work tackles online configuration changes in 
storage systems, but only considers mean request latency 
rather than the upper percentile SLOs we consider. In 
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[16, 32], the authors propose a database replication pol- 
icy for automatic scale up and down. In [32], they use 
a reactive, feed-back controller which monitors request 
latency and adds additional full replicas of the database. 
An enhancement in [16] uses a performance model to 
add replicas via a proactive controller. These papers ad- 
ditionally differ from our work in their assumption that 
the full dataset fits on a single server, thus they only con- 
sider adding a full replica when scaling up (instead of 
also partitioning). 

In [25], the controller adds and removes nodes from a 
distributed file system, rebalancing data as servers come 
and go. However this work focuses more on controlling 
the rebalance speed rather than choosing which data to 
move to which servers; the work additionally does not 
focus on upper-percentile SLOs. 

Some systems target large-scale storage servers with 
terabytes of data on each machine and thus cannot han- 
dle a sustained workload spike or data hotspot because 
the data layout cannot change on-the-fly. For example: 
in Everest [28], the authors propose a write off-loading 
technique that allows them to absorb short burst of writes 
to a large-scale storage system. Performance improve- 
ment is measured as 99th percentile of latency during the 
30 minute experiments, however they do not attempt to 
maintain a stringent SLO over short time intervals. Sierra 
[35] and Rabbit [1] are power-proportional systems that 
alter power consumption based on workload. The ap- 
proach that both papers take is to first provision the sys- 
tem for the peak load with multiple replicas of all data 
and then turn off servers when the workload decreases. 
Both papers evaluate the performance of the system un- 
der the power-proportional controller (Sierra uses the 
99th percentile of latency), but these systems could not 
respond to workload spikes taller than the provisioned 
capacity or to unexpected hotspots that affect individual 
servers. SMART [38] is evaluated on a large file system 
that prevents it from quickly responding to unexpected 
spikes and does not consider upper percentiles of latency. 

Most DHTs [8] are designed to withstand churn in 
the server population without affecting the availability 
and durability of the data. However, quickly adapting 
to changes in user workload and maintaining a stringent 
performance SLO during such changes are not design 
goals. Amazon’s Dynamo [20] is an example of a DHT 
that provides an SLO on the 99.9th percentile of latency, 
but the authors mention that during a busy holiday sea- 
son it took almost a day to copy data to a new server due 
to running the copy action slow enough to avoid perfor- 
mance issues; this low-priority copying would be slow to 
respond to unexpected spikes. 

Much has been published on dynamic resource allo- 
cation for stateless systems such as Web servers or ap- 
plication servers [15, 36, 26, 23, 22, 34], even consider- 


ing stringent performance SLOs. However, most of that 
work does not directly apply to stateful storage systems: 
the control polices for stateless systems need only vary 
the number of active servers because any server can han- 
dle any request. These policies do not have to consider 
the complexities of data movement. 

Aqueduct [27] is a migration engine that moves data in 
a storage system while guaranteeing a performance SLO 
on mean request latency. It does not directly respond to 
workload, but could be used instead of the action sched- 
uler in our control framework (see Section 5.6). 


5 The Control Framework 


This section describes the design and implementation of 
the control framework, incorporating the strategies out- 
lined in Section 3. The framework uses per-server work- 
load and the performance model to determine when a 
server is overloaded and thus when to copy data. It 
chooses what to copy based on workload statistics on 
small units of data (bins). Finer statistics together with 
the models inform where to copy data. 


5.1 The control loop 


The control framework consists of a controller, workload 
forecaster, and action scheduler which, together with the 
storage system and performance models, form a control 
loop (see Figure 5). These components are described in 
more detail in subsequent sections. 

We focus on the controller, which is responsible for 
altering the configuration of the cluster by prescribing 
actions that add/remove servers and move/copy data be- 
tween servers. Its decisions are based on a view of the 
current state given by the workload forecaster and the 
current data layout, in consultation with models that pre- 
dict how servers will perform under particular loads. Af- 
ter the controller compiles a list of actions to run on the 
cluster, the action scheduler executes them. 

Workload statistics are maintained for small ranges of 
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data called bins; each bin is about 10-100 MB of data. 
These bins also represent the unit of data movement. We 
assume a bin cannot be further partitioned and will need 
to be replicated if its workload exceeds the capacity of 
a single server. The total number of bins is a parame- 
ter of the control framework. Setting the value too low 
or too high has its drawbacks. With too few data bins, 
the controller does not have enough flexibility in terms 
of moving data from overloaded servers and might have 
to copy more data than necessary. Having too many data 
bins increases the load on the monitoring system and run- 
ning the controller might take longer since it would have 
to consider more options. In practice, having on average 
five to ten bins per server is a good compromise. 


5.2 A manipulable storage system 


The SCADS [6] storage system provides an interface for 
dynamic scaling: it is easy to control which servers have 
which data, and data can be manipulated as small bins. 
SCADS is an eventually consistent key-value store with 
range partitioning. Each node can serve multiple small 
ranges; e.g., keys A-C, G-I. We use the get and put 
operators; read requests are satisfied from one or more 
servers, and writes are asynchronously propagated and 
flushed to all replicas. 

SCADS provides an interface for copying and mov- 
ing data between pairs of servers; replication is ac- 
complished by copying the target data range to another 
server, and partitioning is the result of moving data from 
one server to another. The SCADS design makes low la- 
tency a top priority, thus all data is kept in memory. This 
characteristic has little impact on the control framework, 
besides simplifying the performance modeling described 
in Section 5.4. 


5.3 Controller 


Given the workload statistics in each bin, the minimal 
number of servers would be achieved by solving a bin- 
packing problem—packing the data bins into servers— 
an NP-complete problem. While approximate algorithms 
exist [37], they typically do not consider the current loca- 
tions of the bins and thus could completely reshuffle the 
data on the servers, a costly operation. Instead, our con- 
troller uses a greedy heuristic that moves data from the 
overloaded servers and coalesces underloaded servers. 
While there are many possible controller implementa- 
tions, we describe our design that leverages the solutions 
outlined above. 

The controller executes periodically to decide how to 
alter the configuration of the cluster; the frequency is an 
implementation parameter. In each iteration, the con- 
troller prescribes actions for overloaded and underloaded 
servers as well as changing the number of servers. By the 
end of an iteration, the controller has compiled a list of 
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Algorithm 1 Controller iteration 
1: estimate workload on each server 
2; identify servers that are overloaded or underloaded 





3: 
4: for all overloaded server S do 
5: while S is overloaded do 


6: determine hottest bin H on S 

7: if workload on H is too high for a single server then 
8: move and replicate H to empty servers 

9: else 

10: move H to the most-loaded underloaded server 


that can accept H without SLO violation 


v: for all underloaded server S do 
3: if S contains only a single bin replica then 


14: remove the bin if no longer necessary 

15: else 

16: for all bin B on S do 

17: move B to most-loaded underloaded server that 
can accept B 

18: if cannot move B then 

19: leave itonS 


21: add/remove servers as necessary, as per previous actions 





actions to be run on the cluster, which are then executed 
by the action scheduler (see Section 5.6). 

Pseudocode for the controller is shown in Algorithm 1. 
Using a performance model (described in the next sub- 
section), the controller predicts which servers are under- 
loaded or overloaded. Lines 4-10 describe the steps for 
fixing an overloaded server: moving bins that have too 
much workload for one server to dedicated servers, or 
moving bins to the most loaded servers that have enough 
capacity, a “best-fit” approach. Next, in lines 12-19, in an 
attempt to deallocate servers for scaling down, the con- 
troller moves bins from the least loaded loaded servers 
to other underloaded servers. Finally, servers are added 
and removed from the cluster. To simplify its reason- 
ing about the current state of the system, the controller 
waits until previously scheduled copy actions complete. 
Long-running actions could block the controller from ex- 
ecuting, preventing it from responding to sudden changes 
in workload. An action that needs to move many bins 
from one server to another. To avoid scheduling such ac- 
tions, the controller uses a copy-duration model to esti- 
mate action duration and splits potentially long-running 
actions into shorter ones. For example, an action that 
needs to move many bins from one server to another can 
be split into several actions that move fewer bins between 
the two servers. If some of the actions do not complete 
within a time threshold, the controller can cancel them 
to reassess the current state and continue to respond to 
workload changes. 

The controller can also maintain a user-specified num- 
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ber of standby servers, a form of extra capacity in addi- 
tion to overprovisioning in the workload smoothing com- 
ponent (see Section 5.5). These standbys help the con- 
troller avoid waiting for new servers to boot up during a 
sudden workload spike, as they are already running the 
storage system software but not serving any data. Stand- 
bys are particularly useful for handling hotspots when 
replicas of a bin require an empty server. 

The presence of a centralized component such as the 
controller does not necessarily mean the system isn’t 
scalable[19]. Nevertheless, there is likely a limit to the 
number of decisions the controller can make per unit 
time for a given number of servers and/or bins. In our 
results, the controller inspects forty servers in a few 
seconds; experimenting with a larger cluster is future 
work. If a decision-making limit is approached, the con- 
troller may need to make decisions less frequently; this 
could impact the attainable SLO if the workload changes 
rapidly. However, with more servers, the controller has 
more flexibility in placing data, meaning it doesn’t have 
to consider many servers when relocating a particular 
bin. 


5.4 Benchmarking and modeling 


The controller uses models of system performance to de- 
termine which servers are overloaded/underloaded and 
to guide its decisions as to which data to move where, 
as well as how many servers to add or remove. Re- 
call that Model-Predictive Control requires an accurate 
model of the system. Instead of responding to changes 
in 99th percentile of request latency, our controller re- 
sponds directly to changes in system workload. There- 
fore, the controller needs a model that accurately predicts 
whether a server can handle a particular workload with- 
out violating the performance SLO. Our controller also 
uses a model of duration of the data copy operations to 
create short copy actions. 

One of the standard approaches to performance mod- 
eling is using analytical models based on network of 
queues. These models require detailed understanding 
of the system and often make strong assumptions about 
the request arrival and service time distributions. Conse- 
quently, analytical models are difficult to construct and 
their predictions might not match the performance of the 
system in production environments. 

Instead, we use statistical machine learning (SML) 
models. As noted in the solutions above, a model-based 
approach allows us to use a signal other than latency in 
the control loop. Consequently, the controller needs an 
accurate model of the system on which to base its de- 
cisions. Building a model typically involves gathering 
training data by introducing a range of inputs into the 
system and observing the outcomes. In a large-scale sys- 
tem it becomes more difficult to construct the appropri- 
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Figure 6: The training data and steady-state model for 
two replicas. The x- and y-axes represent the request 
rates of get and put operations, and the small dots and 
large squares represent workloads that the server can and 
cannot handle, respectively. The solid line crossing the 
four others is the boundary of the performance model. 
SCADS can handle workload rates to the left of this line. 


ate set of inputs [7]. Furthermore, it is more likely in a 
larger system to only be able to observe a subset of the 
component interactions that actually take place. Not hav- 
ing knowledge of all interactions (unmodeled dynamics) 
leads to a less accurate model. 

Fortunately, we can leverage the simple communica- 
tion pattern of SCADS requests to simplify the model- 
ing process. Other key-value stores with similar sim- 
ple requests would also be amenable to modeling. Be- 
low we describe the development and use of two mod- 
els, the steady-state model and the copy-duration model. 
All benchmarks were run on tens of SCADS servers and 
workload-generating clients on Amazon’s Elastic Com- 
pute Cloud (EC2) on m1.small instances. 

Simple changes in workload, such as a shift in pop- 
ularity of individual objects [11, 5, 9], will not affect 
the accuracy of these offline models as all SCADS re- 
quests are served from memory. The performance of 
these offline models (and thus the system) may degrade 
over time if new, unmodeled features are added to the 
application. For example, an individual request may be- 
come more expensive if it returns more data or if new 
types of requests are supported. The model’s degrada- 
tion speed would be application-specific, however these 
feature-change events are known to the developer and the 
offline models can be periodically rebuilt via benchmark- 
ing and fine-tuned in production [12]. 

Steady-state model: The steady-state performance 
model is used to predict whether a server can handle 
a particular workload without violating a given latency 
threshold. The controller uses this model to detect which 
servers are overloaded and to decide where data should 
be moved. To build this model, we benchmark SCADS 
under steady workload for a variety of workload mixes: 
read/write ratios 50/50, 80/20, 90/10 and 95/5 (these 
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mixes are also used in [18]). We then create a linear clas- 
sification model using logistic regression, based on train- 
ing data from the benchmarks. The model has two co- 
variates (features): the workload rate of get and put re- 
quests. For each workload mix, we determine the work- 
load volume at which the latency threshold specified by 
the SLO would be surpassed. This workload volume sep- 
arates two classes: SLO violation or no violation. Thus, 
for a particular workload, the model can predict whether 
a server with that workload would violate the SLO. Fig- 
ure 6 illustrates the steady-state linear model and the 
training data used to generate it. 

Copy-duration model: To allow the controller to es- 
timate how long it will take to copy data between two 
servers, we build a model that predicts the rate of data 
transfer during a copy action. While the copy opera- 
tion in SCADS has a parameter for specifying the num- 
ber of bytes/second at which to transfer data, the actual 
rate is often lower because of activity on both servers in- 
volved. Our model thus predicts the copy-rate factor— 
the ratio of observed to specified copy-rate. A factor of 
0.8 means that the actual copy operation is only 80% 
the specified rate. We use this estimate of the actual rate 
to compute the duration of the copy action. 

To build the model, we benchmark duration of copy 
actions between pairs of servers operating at various 
workload rates. We then model the copy rate factor using 
linear regression; covariates are linear and quadratic in 
the specified rate and get and put request rates. 


While our controller does not directly consider the 
effects of data copy on system performance during 
real-time decisions, we considered these effects when 
designing the controller and the action execution 
modules. Recall that Figure 2 summarizes the results 
of benchmarking SCADS during copy operations; 
performance is affected mostly on the target servers for 
the copy action. Also note that in both performance 
models network utilization and activity of other VMs 
are ignored. These effects are part of environmental 
noise described earlier, and are compensated for with 
replication. 


5.5 Workload Monitoring and Smoothing 


In addition to performance models, the controller needs 
to know how workload is distributed amongst the data. 
Workload is represented by a histogram that contains re- 
quest rates for individual request types (get and put) 
for each bin. To minimize the impact of monitoring on 
performance, we sample 2% of get requests for use in 
our Statistics (out requests are sampled at 40% because 
there are fewer put requests in our workload mixes). We 
found that using higher sampling rates did not greatly im- 
prove accuracy. 
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Every twenty seconds, a summary of the workload 
volume is generated for each bin. This creates the raw 
workload histogram: for each bin we have counts of the 
number of get and put requests to keys in that bin. To 
prevent the controller from reacting to small variance in 
workload, the raw workload is smoothed via hysteresis. 
As scaling up is more important than scaling down with 
respect to performance, we want to respond quickly to 
workload spikes while coalescing servers more slowly. 
We apply smoothing with two parameters: a,,, and 
Qdown. If the workload in a bin increases relative to 
the last time step’s smoothed workload, we smooth that 
bin’s workload with a); otherwise we use the Qgown 
smoothing parameter. For example, in the case of in- 
creasing workload at time t we have: smoothed, = 
smoothed,—1 + Qup * (raw; — smoothed;_1). 

The smoothed workload can also be amplified using 
an overprovisioning factor. Overprovisioning causes the 
controller to think the workload on a server is higher than 
it actually is. For instance, an overprovisioning factor 
of 0.1 would make an actual workload of w appear to 
the controller as 1.1w. Thus overprovisioning creates 
a “safety buffer’ that buys the controller more time to 
move data. For more discussion of tradeoffs, see Sec- 
tion 7. 

The controller bases its decisions on an estimate of 
the workload at each server, determined by sampling the 
requests. Calculating per-bin workload in a centralized 
controller may prove unscalable as the number of re- 
quests to sample grows large. While we used a single 
server to process the requests and compute the per-bin 
workloads, the Chukwa monitoring system [29] could 
be distributed over a cluster of servers. The monitoring 
system could then prioritize the delivery of the monitor- 
ing data to the controller, sending updates only for bins 
with significant changes in workload. Another approach 
would have each server maintain workload information 
over a specified time interval. The controller could then 
query for the workload information when it begins its 
decision-making process. 


5.6 Action Scheduler 


On most storage systems, copying data between servers 
has a negative impact on performance of the interactive 
workload. In SCADS, the copy operation significantly 
affects the target server (see Figure 2), while the source 
server is mostly unaffected. Therefore, executing all 
data copy actions concurrently might overwhelm the sys- 
tem and reduce performance. Executing the actions se- 
quentially would minimize the performance impact, but 
would be very slow. 

In addition to improving steady-state performance of 
storage systems, replication helps smooth performance 
during data copy. We specify a constraint that each bin 
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have at least one replica on a server that is not affected 
by data copy. The action scheduler iterates through its 
list of actions and schedules concurrently all actions that 
do not violate the constraint. When an action completes, 
the scheduler repeats this process with the remaining un- 
scheduled actions. 


5.7 Controller Parameters 


A summary of the parameters used by the controller ap- 
pear in Table 5.7, along with the values used in our ex- 
periments (in Section 6). The hysteresis parameters a1, 
and Qdown affect how abruptly the controller will scale 
up and down. Reasonable values for these parameters 
can be chosen via simulation [13]. 








Controller Parameter | Value 
execution period 20 seconds 
Qup» down 0.9, 0.1 
number standbys 2 
overprovisioning 0.1 or 0.3 
copyrate 4 MB/s 














6 Experimental Results 


We evaluate our control framework implementation by 
stress testing it with two workload profiles that represent 
the main scenarios where our proposed control frame- 
work could be applied. The first workload contains a 
spike on a single data item; as shown in [11], web ap- 
plications typically experience hotspots on a small frac- 
tion of the data. Unexpected workload spikes with data 
hotspots are difficult to handle in stateful systems be- 
cause the location of the hotspot is unknown before the 
spike. Therefore, statically overprovisioning for such 
spikes would be expensive. Managing and monitor- 
ing small data ranges is especially important for dealing 
with these hotspots, particularly when quick replication 
is needed. The second workload exhibits a diurnal work- 
load pattern: workload volume increases during the day 
and decreases at night; this profile demonstrates the ef- 
fectiveness of both scale-up and scale-down. 

For the hotspot workload, we observe how well the 
control framework is able to react to a sudden increase in 
workload volume, as well as how quickly performance 
stabilizes. We also look at the performance impact dur- 
ing this transition period. Note, however, that any sys- 
tem will likely have some visible impact for sufficiently 
strict characteristics of the spike (i.e., how rapidly it ar- 
rives and how much extra workload there is). The di- 
urnal workload additionally exercises the control frame- 
work’s ability to both scale up and down. Finally, we dis- 
cuss some of the tradeoffs of SLO parameters and cost of 
leased resources, as well as potential savings to be gained 
by scaling up and down. 


6.1 Experiment setup 


Experiments were run using Amazon’s Elastic Compute 
Cloud (EC2). We ran SCADS servers on m1.small in- 
stances using 800 MB of the available RAM as each 
storage server’s in-memory cache. We gained an un- 
derstanding of the variance present in this environment 
by benchmarking SCADS’ performance both in the ab- 
sence and presence of data movement, see Section 5.4. 
As described in Section 2, latency variance occurs in 
the upper quantiles even in the absence of data move- 
ment. Therefore we maintain at least two copies of each 
data item, using the replication strategy described earlier: 
each get request is sent to both replicas and we count 
the faster response as its latency. We do not consider the 
latency of put requests, as the work described in this 
paper is targeted towards OLTP-type applications similar 
to those described by [18], in which read requests domi- 
nate writes. Furthermore, evaluating latency for write re- 
quests isn’t applicable in an eventually consistent system, 
such as SCADS. More appropriate would be an SLO on 
data staleness, a subject for future work. 

Workload is generated by a separate set of machines, 
also m1.small instances on EC2. These experiments use 
sixty workload-generating instances and twenty server 
instances. The control framework runs on one m1.xlarge 
instance. The controller uses a 100 ms SLO threshold on 
latency for get requests, and in the description of each 
experiment we discuss the other two parameters of the 
SLO: the percentile at which to evaluate the threshold, 
and the interval over which to assess violations. Table 1 
summarizes the parameter values used in the two experi- 
ments. To avoid running an experiment for an entire day, 
we execute it in a shorter time. We control the length of 
the boot-up time in the experiment by leasing all the vir- 
tual machines needed before the experiment begins and 
simply adding a delay before a “new” server can be used. 
This technique allows us to replay the Ebates.com work- 








Parameter Hotspot Diurnal 
server boot-up time 3 minutes 15 seconds 
server charge interval | 60 minutes | 5 minutes 
server capacity 800 MB 66.7 MB 
size of 1 key-value 256 B 256 B 
total number of keys 4.8 million | 400,000 
minimum # of replicas | 2 2 

total data size 2.2 GB 196 MB 
read/write ratio 95/5 95/5, 

















Table 1: Various experiment parameters for the hotspot 
and diurnal workload experiments. We replay the diurnal 
workload with a speed-up factor of 12 and thus also re- 
duce the server boot-up and charge intervals and the data 
size by a factor of 12. 
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Figure 7: Workload over time in the Hotspot experiment. 
Top row: aggregate request rate during the spike dou- 
bled between 5:12 and 5:17. Bottom row: request rate 
for each of the 200 data bins; the rate for the hot bin in- 
creased to approximately 30,000 reqs/sec. 


load trace [10] 12x faster: replaying twenty-four hours 
of the trace in two hours. To retain the proportionality 
of the other time-related parameters, we scale down by 
12x the data size, server cost interval, boot up time, and 
server release time. The data size is scaled down because 
we can’t speed up the copy rate higher than the network 
bandwidth on m1.small instances allows. Additionally, 
the total data size is limited by the maximum storage on 
the number of servers when the cluster is scaled down. 
As SCADS keeps its data in memory, server capacity is 
limited by available memory on the m1.small instance. 


6.2 Hotspot 


We create a synthetic spike workload based on the 
Statistics of a spike experienced by CNN.com after the 
September 11 attacks [24]. The workload increased by 
an order of magnitude in 15 minutes, which corresponds 
to about 100% increase in 5 minutes. We simulate this 
workload by using a flat, one-hour long period of the 
Ebates.com trace [10] to which we add a workload spike 
with a single hotspot. During a five minute period, the 
aggregate workload volume increases linearly by a fac- 
tor of two, but all the additional workload is directed at a 
single key in the system. Figure 7 depicts the aggregate 
workload and the per-bin workload over time. Notice 
that when the spike occurs, the workload in the hot bin 
greatly exceeds that in all other bins. 

Our controller dynamically creates eight additional 
replicas of this hot data bin to handle the spike. Figure 8 
shows the performance (99th percentile latency) and the 
number of servers over time. The workload spike im- 
pacts performance for a brief period. However, the con- 
troller quickly begins replicating the hot data bin. It first 
uses the two standbys, then requests additional servers. 
Performance stabilizes in less than three minutes. 

It is relatively easy for our control framework to re- 
act to spikes like this because only a very small fraction 
of the data has to be replicated. We can thus handle a 
spike with data hotspots with resources proportional to 
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Figure 8: Performance and resources in the Hotspot ex- 
periment. Top row: 99" percentile of latency along with 
the 100 ms threshold (dashed line). Bottom row: num- 
ber of servers over time. The controller keeps up with 
the spike for the first few minutes, then latency increases 
above the threshold, but the system quickly recovers. 








Interval Max percentile 
5 minutes 98 

1 minute 95 
20 seconds 80 











Table 2: The maximum percentile without SLO viola- 
tions for each interval in the Hotspot experiment. No- 
tice that we can support higher latency percentiles for 
longer time intervals. 


the magnitude of the spike, not proportional to the size 
of the full dataset or the number of servers. 

The performance impact when the spike first arrives 
is brief, but may result in an SLO violation, depending 
how the SLO is specified. The SLO is parameterized by 
the latency threshold, latency percentile, and duration of 
the SLO interval. Fixing the latency threshold at 100 ms, 
in Table 6.2 we show how varying the interval affects the 
maximum percentile under which no violations occurred. 

In general, SLOs specified over a longer time interval 
are easier to maintain despite drastic workload changes; 
this experiment has one five-minute violation. Similarly, 
an SLO with a lower percentile will have fewer violations 
than a higher one. In this experiment, there are zero vi- 
olations over a twenty-second window when looking at 
the 80" percentile of latency, but extending the interval 
to five minutes can yield the 98th percentile. 

The cost tradeoff between SLO violations and leased 
resources depends in part on the cost of a violation. 
Whether a violation costs more than leasing enough 
servers to overprovision the system to satisfy a hotspot 
on any data item will be application-specific. Dynamic 
scaling, however, has the advantage of not having to es- 
timate the magnitude of unexpected spikes. 
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Figure 9: Top: Diurnal workload pattern. Middle: num- 
ber of servers assuming the ideal server allocation and 
two fixed allocations during the diurnal workload exper- 
iment. Bottom: ideal server allocation and two elastic 
allocations using our control framework. 


6.3. Ebates.com diurnal workload trace 


The diurnal workload profile is derived from a trace from 
Ebates.com [10]; we use the trace’s aggregate workload 
pattern; data accesses follow a constant zipfian distribu- 
tion. This profile shows the control framework’s effec- 
tiveness in scaling both up and down as the workload 
volume on all data items fluctuates. We replay twenty- 
four hours of the trace in two hours, a 12x speedup. 

We experiment using two overprovisioning parameters 
(see Section 5.5 on workload smoothing). With 0.3 over- 
provisioning, the smoothed workload is multiplied by a 
factor of 1.3. With more headroom, the system can better 
absorb small spikes in the workload. Using 0.1 overpro- 
visioning has less headroom, thus higher savings at the 
cost of worse performance. 

We compare the results of our experiments with the 
ideal resource allocation and two fixed allocation calcu- 
lations. In the ideal allocation, we assume that we know 
the workload at each time step throughout the experiment 
and compute the minimum number of servers we would 
need to support this workload for each 5-minute interval 
(the scaled-down server cost interval). The ideal alloca- 
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tion assumes that moving data is instantaneous and has 
no effect on performance, and provides the lower bound 
on the number of compute resources required to handle 
this workload without SLO violations. 

The fixed-100% and fixed-70% allocations use a con- 
stant number of servers throughout the experiment. 
Fixed-100% assumes the workload’s peak value is 
known a priori, and computes the number of servers 
based on that value and the maximum throughput of 
each server (7000 requests per second, see Section 5.4). 
The number of servers used in the fixed-100% alloca- 
tion equals the maximum number of servers used by the 
ideal allocation. Fixed-70% is calculated similarly to the 
fixed-100%, but restricts the servers’ utilization to 70% 
of their potential throughput (i.e., 7,000 * 0.7 = 4,900 
requests per second). Fixed-100% is the ideal fixed al- 
location, but in practice datacenter operators often add 
more headroom to absorb unexpected spikes. 

Figure 9 shows the workload profile and the number 
of server units used by the different allocation policies: 
ideal, fixed- 100%, fixed-70%, and our elastic policy with 
overprovisioning of 0.3 and 0.1. A server unit corre- 
sponds to one server being used for one charge interval, 
thus fewer server units used translates to monetary cost 
savings. The policy with 0.1 overprovisioning achieves 
savings of 16% and 41% compared to the fixed-100% 
and fixed-70% allocations, respectively. 

The ideal resource allocation uses 175 servers units, 
while using overprovisioning of 0.1 uses 241 server 
units. However, recall that our controller maintains 
two empty standby servers to quickly respond to data 
hotspots that require replication. The actual number of 
server units used for serving data is thus 191 which is 
within 10% of the ideal allocation?. 

Performance and SLO violations are summarized in 
Figure 10. Note that it is more difficult to maintain SLOs 
with shorter time intervals and higher percentiles. 


7 Discussion 


The experiments demonstrate the control framework’s 
effectiveness in scaling both up and down for typical 
workload profiles that exhibit fluctuating workload pat- 
terns. Having the same mechanism work well in sce- 
narios with rapidly appearing hotspots as well as more 
gradual variations is advantageous because application 
developers won’t need to decide a priori what type of 
growth to prepare for: the same control framework can 
dynamically scale as needed in either case. For operators 
who still prefer to maintain a fixed allocation for non- 
spike, peak traffic, say on their own hardware, there is 
still potential to utilize the control framework for surge 
computing in the cloud. A temporary spike could be sat- 


3 The experiment has a total of 25 5-minute server-charging intervals 
which yields 50 server units used by the standbys. 
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Figure 10: Top: Number of SLO violations during the 
0.1 overprovisioning diurnal experiment, for different 
values of the SLO percentile. The three lines in the graph 
correspond to the three intervals over which to evaluate 
the SLO: 5 minutes, 1 minute, and 20 seconds. Bottom: 
summary of SLO violations and maximum latency per- 
centile supported with no SLO violations during the di- 
ural workload with two different overprovisioning pa- 
rameters. 


isfied with leased resources from the cloud, which would 
be relinquished once the spike subsides. 

There are cost implications in setting some of the con- 
trol framework’s parameters to manage “extra capacity,” 
namely the number of standbys and the overprovision- 
ing factor. Both these techniques result in higher server 
costs, either due to maintaining booted empty servers for 
standbys or underutilization of active servers in the case 
of overprovisioning. Standby servers are particularly 
helpful for dealing with workload spikes which neces- 
sitate replication, as empty servers are waiting and ready 
to receive data. Overprovisioning is better for workload 
profiles like a diurnal pattern in which all data items more 
slowly experience increased access rates; this headroom 
allows the control framework more time to shuffle data 
around without overloading the servers. Reducing the 
number of standbys and/or the overprovisioning factor 
can yield cost savings, with the associated risk of SLO 
violations if scaling up is not performed rapidly enough. 

We presented results of our controller using replica- 
tion to both smooth variance and lessen the effects of data 
movement. To see that the controller remains robust to 
the variance in the environment without replication, we 
performed the same two experiments using only a single 
copy of each data item. While SCADS still scales effec- 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


tively, the variance limits the attainable SLO percentile. 
For example, in the hotspot workload, the 5-minute, 1- 
minute, and 20-second attainable percentiles were 95, 
80, and 80, respectively, compared to 98, 95, and 80 
when using replication. The replication factor thus of- 
fers a tradeoff between performance/robustness and the 
cost of running the system. Note, however, that a differ- 
ent environment than EC2, like dedicated hardware, may 
have less variance and thus may achieve the desired SLO 
without replication. 

The ability to control these performance tradeoffs is an 
advantage of running the SCADS key-value store on EC2 
rather than simply using S3 for data storage. In general, 
S3 is optimized for larger files and has nontrivial over- 
head per HTTP request. S3 also does not offer a SLO 
on latency, while SCADS offers a developer-specified 
SLO. Data replication factor and data location are not 
tunable with S3, which would make maintaining a par- 
ticular SLO difficult. More fundamentally, $3 does not 
provide the API that SCADS on EC2 does. SCADS sup- 
ports features like TestAndSet () and various meth- 
ods on ranges of keys; this enables a higher level query 
language on top. Additionally, the SCADS client library 
supports read/write quorums for trading off performance 
and consistency, this would also be meaningless without 
being able to control the replication factor. 


8 Future Work 


Future work includes incorporating resource heterogene- 
ity in the control framework, as well as designing a 
framework simulator for performing what-if analysis. 
Cloud providers typically offer a variety of resources at 
different cost, e.g., paying more per hour for a server 
with more CPU or disk capacity. By modeling perfor- 
mance of different server types, we could include in the 
control framework decisions about which type of server 
to use. Additionally, we hope to use the performance 
models in a control framework simulator that emulates 
the behavior of real servers. The simulator could be used 
for assessing the performance-cost tradeoff for unseen 
workloads; developers could create synthetic workloads 
using the features described in [11]. 


9 Conclusion 


The elasticity of the cloud provides an opportunity for 
dynamic resource allocation, scaling up when workload 
increases and scaling down to save money. ‘To date, 
this opportunity has been exploited primarily by stateless 
services, in which simply adding and removing servers 
is sufficient to track workload variation. Our goal was 
to design a control framework that could automatically 
scale a stateful key-value store in the cloud while com- 
plying with a stringent performance SLO in which a very 
high percentile of requests (typically 99%) must meet a 
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specific latency bound. As described in Section 2, meet- 
ing such a stringent SLO is challenging both because of 
high variance in the tail of the request latency distribu- 
tion and because of the need to copy data in addition to 
adding and removing servers. Our solution avoids trying 
to control for such a noisy latency signal, instead using 
a model-based approach that maps workload to latency. 
This model, combined with fine-grained workload statis- 
tics, allows the framework to move only necessary data 
to alleviate performance issues while keeping the amount 
of leased resources needed to satisfy the current work- 
load. In the event of an unexpected hotspot, replicas are 
added proportional to the magnitude of the spike, not the 
total number of servers. For workload that exhibits a di- 
urnal pattern, the framework easily scales both up and 
down as the workload fluctuates. In the midst of this dy- 
namic scaling, we use replication to mask both inherent 
environmental noise and the performance perturbations 
introduced by data movement. We anticipate that this 
work provides a useful starting point for allowing large- 
scale storage systems to take advantage of the elasticity 
of cloud computing. 


10 Acknowledgements 


We would like to thank our fellow students and faculty in 
the RAD lab for their ongoing thoughtful advice through- 
out this project. We additionally thank Kim Keeton, Tim 
Kraska, Ari Rabkin, Eno Thereska, and John Wilkes for 
their feedback on this paper. 

This research is supported in part by a National Sci- 
ence Foundation graduate fellowship, and gifts from Sun 
Microsystems, Google, Microsoft, Amazon Web Ser- 
vices, Cisco Systems, Cloudera, eBay, Facebook, Fu- 
jitsu, Hewlett-Packard, Intel, Network Appliance, SAP, 
VMWare and Yahoo! and by matching funds from the 
State of California’s MICRO program (grants 06-152, 
07-010, 06-148, 07-012, 06-146, 07-009, 06-147, 07- 
013, 06-149, 06-150, and 07-008), the National Science 
Foundation (grant #CNS-0509559), and the University 
of California Industry/University Cooperative Research 
Program (UC Discovery) grant COM07- 10240. 


References 


[1] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. 
Kozuch, and K. Schwan. Robust and Flexible 
Power-Proportional Storage. In SoCC: ACM Sym- 
posium on Cloud Computing, 2010. 


[2] E. Anderson, M. Hobbs, K. Keeton, S. Spence, 
M. Uysal, and A. Veitch. Hippodrome: Run- 
ning Circles Around Storage Administration. In 
FAST: Conference on File and Storage Technolo- 
gies, 2002. 


[3] Apache. Cassandra. incuba- 


tor.apache.org/cassandra, 2010. 
[4] Apache. HBase. hadoop.apache.org/hbase, 2010. 


[5] M. Arlitt and T. Jin. Workload Characterization of 
the 1998 World Cup Web Site. Technical Report 
HPL-1999-35R1, HP Labs, 1999. 


[6] M. Armbrust, A. Fox, D. Patterson, N. Lanham, 
H. Oh, B. Trushkowsky, and J. Trutna. SCADS: 
Scale-independent Storage for Social Computing 
Applications. In Conference on Innovative Data 
Systems Research (CIDR), 2009. 


[7] K. J. Astrém. Introduction to Stochastic Control 
Theory. Academic Press, 1970. 


[8] H. Balakrishnan, M. F. Kaashoek, D. Karger, 
R. Morris, and I. Stoica. Looking Up Data in 
P2P Systems. Communications of the ACM, 46(2), 
February 2003. 


[9] P. Barford and M. Crovella. Generating Repre- 
sentative Web Workloads for Network and Server 
Performance Evaluation. In Proceedings of the 
ACM SIGMETRICS Joint International Confer- 
ence, 1998. 


[10] P. Bodik et al. Combining Visualization and Statis- 
tical Analysis to Improve Operator Confidence and 
Efficiency for Failure Detection and Localization. 
In International Conference on Autonomic Com- 


puting (ICAC), 2005. 


[11 


ray 


P. Bodik, A. Fox, M. J. Franklin, M. I. Jordan, 
and D. Patterson. Characterizing, Modeling, and 
Generating Workload Spikes for Stateful Services. 
In SoCC: ACM Symposium on Cloud Computing, 
2010. 


o 
— 
N 

— 


P. Bodik, R. Griffith, C. Sutton, A. Fox, M. I. Jor- 
dan, and D. A. Patterson. Automatic Exploration 
of Datacenter Performance Regimes. In Proceed- 
ings of the Ist Workshop on Automated Control for 
Datacenters and Clouds, 2009. 


P. Bodik, R. Griffith, C. Sutton, A. Fox, M. I. 
Jordan, and D. A. Patterson. Statistical Machine 
Learning Makes Automatic Control Practical for 
Internet Datacenters. In Workshop on Hot Topics 
in Cloud Computing (HotCloud), 2009. 


[14] F Chang et al. Bigtable: A Distributed Storage 
System for Structured Data. In OSDI: Symposium 
on Operating Systems Design and Implementation, 


2006. 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


175 


176 


[15] 


[16 


= 


[17] 


[18] 


[21] 


[22] 


[23] 


[24] 


[25] 


[26] 


J. Chase, D. C. Anderson, P. N. Thakar, A. M. Vah- 
dat, and R. P. Doyle. Managing Energy and Server 
Resources in Hosting Centers. In Symposium on 
Operating Systems Principles (SOSP), 2001. 


J. Chen, G. Soundararajan, and C. Amza. Auto- 
nomic Provisioning of Backend Databases in Dy- 
namic Content Web Servers. In International Con- 
ference on Autonomic Computing (ICAC), 2006. 


B. EF Cooper, R. Ramakrishnan, U. Srivastava, 
A. Silberstein, P. Bohannon, H.-A. Jacobsen, 
N. Puz, D. Weaver, and R. Yerneni. PNUTS: Ya- 
hoo!’s Hosted Data Serving Platform. In Proceed- 
ings of the International Conference on Very Large 
Databases (VLDB), 2008. 


B. F. Cooper, A. Silberstein, E. Tam, R. Ramakr- 
ishnan, and R. Sears. Benchmarking cloud serving 
systems with YCSB. In SoCC: ACM Symposium 
on Cloud Computing, 2010. 


J. Dean. Evolution and Future Directions of Large- 
scale Storage and Computation Systems at Google. 
In SoCC: ACM Symposium on Cloud Computing, 
2010. 


G. DeCandia, D. Hastorun, M. Jampani, G. Kakula- 
pati, A. Lakshman, A. Pilchin, S. Sivasubramanian, 
P. Vosshall, and W. Vogels. Dynamo: Amazon’s 
Highly Available Key-value Store. In Symposium 
on Operating Systems Principles (SOSP), 2007. 


S. Ghemawat, H. Gobioff, and S.-T. Leung. The 
Google File System. In Symposium on Operating 
Systems Principles (SOSP), 2003. 


J. L. Hellerstein, V. Morrison, and E. Eile- 
brecht. Optimizing Concurrency Levels in the 
.NET ThreadPool: A Case Study of Controller De- 
sign and Implementation. In Workshop on Feed- 
back Control Implementation and Design in Com- 
puting Systems and Networks, 2008. 


D. Kusic et al. Power and Performance Manage- 
ment of Virtualized Computing Environments Via 
Lookahead Control. In International Conference 
on Autonomic Computing (ICAC), 2008. 


W. LeFebvre. CNN.com: Facing a world crisis. 
www.tcsa.org/lisa2001/cnn.txt, 2001. 


H. C. Lim, S. Babu, and J. S. Chase. Automated 
Control for Elastic Storage. In International Con- 
ference on Autonomic Computing (ICAC), 2010. 


X. Liu, J. Heo, L. Sha, and X. Zhu. Adaptive 
Control of Multi-Tiered Web Applications Using 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


[27] 


[32] 


[34] 


[35] 


Queueing Predictor. Network Operations and Man- 
agement Symposium, April 2006. 


C. Lu, G. A. Alvarez, and J. Wilkes. Aqueduct: On- 
line Data Migration with Performance Guarantees. 
In FAST: Conference on File and Storage Technolo- 
gies, 2002. 


D. Narayanan, A. Donnelly, E. Thereska, S. El- 
nikety, and A. I. T. Rowstron. Everest: Scaling 
Down Peak Loads Through I/O Off-Loading. In 
OSDI: Symposium on Operating Systems Design 
and Implementation, 2008. 


A. Rabkin and R. H. Katz. Chukwa: A System 
for Reliable Large-scale Log Collection. Master’s 
thesis, EECS Department, University of California, 
Berkeley, Mar 2010. 


J. A. Rossiter. Model Based Predictive Control: A 
Practical Approach. CRC Press, 2003. 


J. Rothschild. High Performance at Mas- 
sive Scale - Lessons Learned at Facebook. 
http://cns.ucsd.edu/lecturearchive09.shtml#Roth, 
October 2009. 


G. Soundararajan, C. Amza, and A. Goel. Database 
Replication Policies for Dynamic Content Applica- 
tions. In EuroSys: ACM SIGOPS/EuroSys Euro- 
pean Conference on Computer Systems, 2006. 


J. D. Strunk, E. Thereska, C. Faloutsos, and G. R. 
Ganger. Using Utility to Provision Storage Sys- 
tems. In FAST: Conference on File and Storage 
Technologies, 2008. 


G. Tesauro, N. Jong, R. Das, and M. Bennani. A 
Hybrid Reinforcement Learning Aproach to Auto- 
nomic Resource Allocation. In International Con- 
ference on Autonomic Computing (ICAC), 2006. 


E. Thereska, A. Donnelly, and D. Narayanan. 
Sierra: A Power-Proportional, Distributed storage 
System. Technical Report MSR-TR-2009-153, Mi- 
crosoft, 2009. 


B. Urgaonkar, P. Shenoy, A. Chandra, and P. Goyal. 
Dynamic Provisioning of Multi-tier Internet Appli- 
cations. In International Conference on Autonomic 


Computing (ICAC), 2005. 


V. V. Vazirani. 
Springer, 2003. 


L. Yin, S. Uttamchandani, M. Korupolu, K. Voru- 
ganti, and R. Katz. SMART: An Integrated Multi- 
Action Advisor for Storage Systems. In USENIX 
Annual Technical Conference, 2006. 


Approximation Algorithms. 


USENIX Association 


USENIX Association 


Scale and Concurrency of GIGA+: 
File System Directories with Millions of Files 


Swapnil Patil and Garth Gibson 
Carnegie Mellon University 


{firstname.lastname @ cs.cmu.edu} 


Abstract — We examine the problem of scalable file system 
directories, motivated by data-intensive applications requiring 
millions to billions of small files to be ingested in a single di- 
rectory at rates of hundreds of thousands of file creates every 
second. We introduce a POSIX-compliant scalable directory 
design, GIGA+, that distributes directory entries over a cluster 
of server nodes. For scalability, each server makes only local, in- 
dependent decisions about migration for load balancing. GIGA+ 
uses two internal implementation tenets, asynchrony and even- 
tual consistency, to: (1) partition an index among all servers 
without synchronization or serialization, and (2) gracefully tol- 
erate stale index state at the clients. Applications, however, are 
provided traditional strong synchronous consistency semantics. 
We have built and demonstrated that the GIGA+ approach scales 
better than existing distributed directory implementations, deliv- 
ers a sustained throughput of more than 98,000 file creates per 
second on a 32-server cluster, and balances load more efficiently 
than consistent hashing. 


1 Introduction 


Modern file systems deliver scalable performance for large 
files, but not for large numbers of files [18, 67]. In par- 
ticular, they lack scalable support for ingesting millions 
to billions of small files in a single directory - a growing 
use case for data-intensive applications [18, 44, 50]. We 
present a file system directory service, GIGA+, that uses 
highly concurrent and decentralized hash-based indexing, 
and that scales to store at least millions of files in a sin- 
gle POSIX-compliant directory and sustain hundreds of 
thousands of creates insertions per second. 


The key feature of the GIGA+ approach is to enable 
higher concurrency for index mutations (particularly cre- 
ates) by eliminating system-wide serialization and syn- 
chronization. GIGA+ realizes this principle by aggres- 
sively distributing large, mutating directories over a clus- 
ter of server nodes, by disabling directory entry caching 
in clients, and by allowing each node to migrate, without 
notification or synchronization, portions of the directory 
for load balancing. Like traditional hash-based distributed 
indices [17, 36, 52], GIGA+ incrementally hashes a direc- 
tory into a growing number of partitions. However, GIGA+ 
tries harder to eliminate synchronization and prohibits mi- 


gration if load balancing is unlikely to be improved. 


Clients do not cache directory entries; they cache only 
the directory index. This cached index can have stale point- 
ers to servers that no longer manage specific ranges in the 
space of the hashed directory entries (filenames). Clients 
using stale index values to target an incorrect server have 
their cached index corrected by the incorrectly targeted 
server. Stale client indices are aggressively improved by 
transmitting the history of splits of all partitions known 
to a server. Even the addition of new servers is supported 
with minimal migration of directory entries and delayed 
notification to clients. In addition, because 99.99% of the 
directories have less than 8,000 entries [4, 14], GIGA+ 
represents small directories in one partition so most direc- 
tories will be essentially like traditional directories. 


Since modern cluster file systems have support for data 
striping and failure recovery, our goal is not to compete 
with all feature of these systems, but to offer additional 
technology to support high rates of mutation of many 
small files.! We have built a skeleton cluster file system 
with GIGA+ directories that layers on existing lower layer 
file systems using FUSE [19]. Unlike the current trend of 
using special purpose storage systems with custom inter- 
faces and semantics [6, 20, 54], GIGA+ directories use the 
traditional UNIX VFS interface and provide POSIX-like 
semantics to support unmodified applications. 


Our evaluation demonstrates that GIGA+ directories 
scale linearly on a cluster of 32 servers and deliver a 
throughput of more than 98,000 file creates per second 
— outscaling the Ceph file system [63] and the HBase 
distributed key-value store [26], and exceeding peta- 
scale scalability requirements [44]. GIGA+ indexing also 
achieves effective load balancing with one to two orders 
of magnitude less re-partitioning than if it was based on 
consistent hashing [30, 58]. 


In the rest of the paper, we present the motivating use 
cases and related work in Section 2, the GIGA+ indexing 
design and implementation in Sections 3-4, the evaluation 
results in Section 5, and conclusion in Section 6. 


‘OrangeFS is currently integrating a GIGA+ based distributed direc- 
tory implementation into a system based on PVFS [2, 45]. 
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2 Motivation and Background 


Over the last two decades, research in large file systems 
was driven by application workloads that emphasized ac- 
cess to very large files. Most cluster file systems provide 
scalable file I/O bandwidth by enabling parallel access 
using techniques such as data striping [20, 21, 25], object- 
based architectures [21, 39, 63, 66] and distributed locking 
[52, 60, 63]. Few file systems scale metadata performance 
by using a coarse-grained distribution of metadata over 
multiple servers [16, 46, 52, 63]. But most file systems 
cannot scale access to a large number of files, much less ef- 
ficiently support concurrent creation of millions to billions 
of files in a single directory. This section summarizes the 
technology trends calling for scalable directories and how 
current file systems are ill-suited to satisfy this call. 


2.1 


In today’s supercomputers, the most important I/O work- 
load is checkpoint-restart, where many parallel applica- 
tions running on, for instance, ORNL’s CrayXT5 cluster 
(with 18,688 nodes of twelve processors each) periodically 
write application state into a file per process, all stored in 
one directory [7, 61]. Applications that do this per-process 
checkpointing are sensitive to long file creation delays be- 
cause of the generally slow file creation rate, especially 
in one directory, in today’s file systems [7]. Today’s re- 
quirement for 40,000 file creates per second in a single 
directory [44] will become much bigger in the impending 
Exascale-era, when applications may run on clusters with 
up to billions of CPU cores [31]. 

Supercomputing checkpoint-restart, although important, 
might not be a sufficient reason for overhauling the cur- 
rent file system directory implementations. Yet there are 
diverse applications, such as gene sequencing, image pro- 
cessing [62], phone logs for accounting and billing, and 
photo storage [6], that essentially want to store an un- 
bounded number of files that are logically part of one 
directory. Although these applications are often using 
the file system as a fast, lightweight “key-value store’, 
replacing the underlying file system with a database is 
an oft-rejected option because it is undesirable to port 
existing code to use a new API (like SQL) and because tra- 
ditional databases do not provide the scalability of cluster 
file systems running on thousands of nodes [3, 5, 53, 59]. 


Motivation 


Authors of applications seeking lightweight stores for 
lots of small data can either rewrite applications to avoid 
large directories or rely on underlying file systems to im- 
prove support for large directories. Numerous applica- 
tions, including browsers and web caches, use the for- 
mer approach where the application manages a large 
logical directory by creating many small, intermediate 
sub-directories with files hashed into one of these sub- 
directories. This paper chose the latter approach because 
users prefer this solution. Separating large directory man- 
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agement from applications has two advantages. First, 
developers do not need to re-implement large directory 
management for every application (and can avoid writing 
and debugging complex code). Second, an application- 
agnostic large directory subsystem can make more in- 
formed decisions about dynamic aspects of a large direc- 
tory implementation, such as load-adaptive partitioning 
and growth rate specific migration scheduling. 

Unfortunately most file system directories do not cur- 
rently provide the desired scalability: popular local file 
systems are still being designed to handle little more than 
tens of thousands of files in each directory [43, 57, 68] 
and even distributed file systems that run on the largest 
clusters, including HDFS [54], GoogleFS [20], PanFS 
[66] and PVFS [46], are limited by the speed of the single 
metadata server that manages an entire directory. In fact, 
because GoogleFS scaled up to only about 50 million files, 
the next version, ColossusFS, will use BigTable [12] to 
provide a distributed file system metadata service [18]. 

Although there are file systems that distribute the direc- 
tory tree over different servers, such as Farsite [16] and 
PVFS [46], to our knowledge, only three file systems now 
(or soon will) distribute single large directories: IBM’s 
GPFS [52], Oracle’s Lustre [38], and UCSC’s Ceph [63]. 
2.2 Related work 


GIGA+ has been influenced by the scalability and concur- 
rency limitations of several distributed indices and their 
implementations. 


GPFS: GPEFS is a shared-disk file system that uses a 
distributed implementation of Fagin’s extendible hashing 
for its directories [17, 52]. Fagin’s extendible hashing 
dynamically doubles the size of the hash-table pointing 
pairs of links to the original bucket and expanding only 
the overflowing bucket (by restricting implementations to 
a specific family of hash functions) [17]. It has a two-level 
hierarchy: buckets (to store the directory entries) and a 
table of pointers (to the buckets). GPFS represents each 
bucket as a disk block and the pointer table as the block 
pointers in the directory’s i-node. When the directory 
grows in size, GPFS allocates new blocks, moves some of 
the directory entries from the overflowing block into the 
new block and updates the block pointers in the i-node. 


GPFS employs its client cache consistency and dis- 
tributed locking mechanism to enable concurrent access to 
a shared directory [52]. Concurrent readers can cache the 
directory blocks using shared reader locks, which enables 
high performance for read-intensive workloads. Concur- 
rent writers, however, need to acquire write locks from the 
lock manager before updating the directory blocks stored 
on the shared disk storage. When releasing (or acquir- 
ing) locks, GPFS versions before 3.2.1 force the directory 
block to be flushed to disk (or read back from disk) induc- 
ing high I/O overhead. Newer releases of GPFS have mod- 
ified the cache consistency protocol to send the directory 
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insert requests directly to the current lock holder, instead 
of getting the block through the shared disk subsystem 
[1, 22, 27]. Still GPFS continues to synchronously write 
the directory’s i-node (i.e., the mapping state) invalidating 
client caches to provide strong consistency guarantees [1]. 
In contrast, GIGA+ allows the mapping state to be stale 
at the client and never be shared between servers, thus 
seeking even more scalability. 

Lustre and Ceph: Lustre’s proposed clustered metadata 
service splits a directory using a hash of the directory en- 
tries only once over all available metadata servers when it 
exceeds a threshold size [37, 38]. The effectiveness of this 
"split once and for all" scheme depends on the eventual 
directory size and does not respond to dynamic increases 
in the number of servers. Ceph is another object-based 
cluster file system that uses dynamic sub-tree partitioning 
of the namespace and hashes individual directories when 
they get too big or experience too many accesses [63, 64]. 
Compared to Lustre and Ceph, GIGA+ splits a directory 
incrementally as a function of size, i.e., a small directory 
may be distributed over fewer servers than a larger one. 
Furthermore, GIGA+ facilitates dynamic server addition 
achieving balanced server load with minimal migration. 

Linear hashing and LH*:; Linear hashing grows a hash 
table by splitting its hash buckets in a linear order using a 
pointer to the next bucket to split [34]. Its distributed vari- 
ant, called LH* [35], stores buckets on multiple servers 
and uses a central split coordinator that advances permis- 
sion to split a partition to the next server. An attractive 
property of LH* is that it does not update a client’s map- 
ping state synchronously after every new split. 


GIGA+ differs from LH* in several ways. To main- 
tain consistency of the split pointer (at the coordinator), 
LH* splits only one bucket at a time [35, 36]; GIGA+ 
allows any server to split a bucket at any time without any 
coordination. LH* offers a complex partition pre-split op- 
timization for higher concurrency [36], but it causes LH* 
clients to continuously incur some addressing errors even 
after the index stops growing; GIGA+ chose to minimize 
(and stop) addressing errors at the cost of more client state. 

Consistent hashing: Consistent hashing divides the hash- 
space into randomly sized ranges distributed over server 
nodes [30, 58]. Consistent hashing is efficient at managing 
membership changes because server changes split or join 
hash-ranges of adjacent servers only, making it popular for 
wide-area peer-to-peer storage systems that have high rates 
of membership churn [13, 42, 48, 51]. Cluster systems, 
even though they have much lower churn than Internet- 
wide systems, have also used consistent hashing for data 
partitioning [15, 32], but have faced interesting challenges. 

As observed in Amazon’s Dynamo, consistent hashing’s 
data distribution has a high load variance, even after using 
“virtual servers” to map multiple randomly sized hash- 
ranges to each node [15]. GIGA+ uses threshold-based 


binary splitting that provides better load distribution even 
for small clusters. Furthermore, consistent hashing sys- 
tems assume that every data-set needs to be distributed 
over many nodes to begin with, i.e., they do not have sup- 
port for incrementally growing data-sets that are mostly 
small — an important property of file system directories. 

Other work: DDS [24] and Boxwood [40] also used 
scalable data-structures for storage infrastructure. While 
both GIGA+ and DDS use hash tables, GIGA+’s focus is 
on directories, unlike DDS’s general cluster abstractions, 
with an emphasis on indexing that uses inconsistency at 
the clients; a non-goal for DDS [24]. Boxwood proposed 
primitives to simplify storage system development, and 
used B-link trees for storage layouts [40]. 


3. GIGA+ Indexing Design 
3.1 Assumptions 


GIGA+ is intended to be integrated into a modern cluster 
file system like PVFS, PanFS, GoogleFS, HDFS etc. All 
these scalable file systems have good fault tolerance usu- 
ally including a consensus protocol for node membership 
and global configuration [9, 29, 65]. GIGA+ is not de- 
signed to replace membership or fault tolerance; it avoids 
this where possible and employs them where needed. 


GIGA+ design is also guided by several assumptions 
about its use cases. First, most file system directories 
start small and remain small; studies of large file sys- 
tems have found that 99.99% of the directories contain 
fewer than 8,000 files [4, 14]. Since only a few directories 
grow to really large sizes, GIGA+ is designed for incre- 
mental growth, that is, an empty or a small directory is 
initially stored on one server and is partitioned over an 
increasing number of servers as it grows in size. Perhaps 
most beneficially, incremental growth in GIGA+ handles 
adding servers gracefully. This allows GIGA+ to avoid 
degrading small directory performance; striping small di- 
rectories across multiple servers will lead to inefficient 
resource utilization, particularly for directory scans (us- 
ing readdir () ) that will incur disk-seek latency on all 
servers only to read tiny partitions. 


Second, because GIGA-+ is targeting concurrently shared 
directories with up to billions of files, caching such direc- 
tories at each client is impractical: the directories are too 
large and the rate of change too high. GIGA+ clients do 
not cache directories and send all directory operations to 
a server. Directory caching only for small rarely changing 
directories is an obvious extension employed, for example, 
by PanFS [66], that we have not yet implemented. 

Finally, our goal in this research is to complement ex- 
isting cluster file systems and support unmodified appli- 
cations. So GIGA+ directories provide the strong consis- 
tency for directory entries and files that most POSIX-like 
file systems provide, i.e., once a client creates a file in a 
directory all other clients can access the file. This strong 
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Figure 1 — Concurrent and unsynchronized data partitioning in GIGA+. The hash-space (0, 1] is divided into multiple partitions 
(P;) that are distributed over many servers (different shades of gray). Each server has a local, partial view of the entire index and can 
independently split its partitions without global co-ordination. In addition to enabling highly concurrent growth, an index starts small 


(on one server) and scales out incrementally. 


consistency API differentiates GIGA+ from “relaxed” con- 
sistency provided by newer storage systems including 
NoSQL systems like Cassandra [32] and Dynamo [15]. 


3.2. Unsynchronized data partitioning 


GIGA+ uses hash-based indexing to incrementally divide 
each directory into multiple partitions that are distributed 
over multiple servers. Each filename (contained in a direc- 
tory entry) is hashed and then mapped to a partition using 
an index. Our implementation uses the cryptographic 
MDS hash function but is not specific to it. GIGA+ relies 
only on one property of the selected hash function: for 
any distribution of unique filenames, the hash values of 
these filenames must be uniformly distributed in the hash 
space [49]. This is the core mechanism that GIGA+ uses 
for load balancing. 


Figure | shows how GIGA+ indexing grows incremen- 
tally. In this example, a directory is to be spread over three 
servers {So,51,S2} in three shades of gray color. pe 
denotes the hash-space range (x, y] held by a partition with 
the unique identifier i.7 GIGA+ uses the identifier i to map 
P; to an appropriate server S; using a round-robin mapping, 
l.e., server S; is i modulo num_servers. The color of each 
partition indicates the (color of the) server it resides on. 
Initially, at time 7p, the directory is small and stored on a 
single partition pol on server So. As the directory grows 
and the partition size exceeds a threshold number of direc- 
tory entries, provided this server knows of an underutilized 


(0,1) 
0 


server, Sg splits P) ’” into two by moving the greater half 


of its hash-space range to a new partition per on S;. As 


the directory expands, servers continue to split partitions 
onto more servers until all have about the same fraction 
of the hash-space to manage (analyzed in Section 5.2 and 


?For simplicity, we disallow the hash value zero from being used. 
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5.3). GIGA+ computes a split’s target partition identifier 
using well-known radix-based techniques.* 


The key goal for GIGA+ is for each server to split inde- 
pendently, without system-wide serialization or synchro- 
nization. Accordingly, servers make local decisions to 
split a partition. The side-effect of uncoordinated growth 
is that GIGA+ servers do not have a global view of the 
partition-to-server mapping on any one server; each server 
only has a partial view of the entire index (the mapping 
tables in Figure 1). Other than the partitions that a server 
manages, a server knows only the identity of the server 
that knows more about each “child” partition resulting 
from a prior split by this server. In Figure 1, at time 73, 
server S; manages partition P; at tree depth r = 3, and 
knows that it previously split P; to create children parti- 
tions, P; and Ps, on servers Sg and S2 respectively. Servers 
are mostly unaware about partition splits that happen on 
other servers (and did not target them); for instance, at 
time 73, server So is unaware of partition Ps and server Sj 
is unaware of partition P). 


Specifically, each server knows only the split history 
of its partitions. The full GIGA+ index is a complete 
history of the directory partitioning, which is the transitive 
closure over the local mappings on each server. This full 
index is also not maintained synchronously by any client. 
GIGA+ clients can enumerate the partitions of a directory 
by traversing its split histories starting with the zeroth 
partition Py. However, such a full index constructed and 


3GIGA+ calculates the identifier of partition i using the depth of the 
tree, r, which is derived from the number of splits of the zeroth partition 
Po. Specifically, if a partition has an identifier i and is at tree depth r, 
then in the next split P; will move half of its filenames, from the larger 
half of its hash-range, to a new partition with identifier i+ 2”. After 


a split completes, both partitions will be at depth r+ 1 in the tree. In 


Figure 1, for example, partition poeta, with identifier i = 1, is at tree 


depth r = 2. A split causes P; to move the larger half of its hash-space 
(0.625, 0.75] to the newly created partition Ps, and both partitions are 
then at tree depth of r = 3. 
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cached by a client may be stale at any time, particularly 
for rapidly mutating directories. 


3.3 Tolerating inconsistent mapping at clients 


Clients seeking a specific filename find the appropriate 
partition by probing servers, possibly incorrectly, based 
on their cached index. To construct this index, a client 
must have resolved the directory’s parent directory entry 
which contains a cluster-wide i-node identifying the server 
and partition for the zeroth partition Py. Partition Py may 
be the appropriate partition for the sought filename, or it 
may not because of a previous partition split that the client 
has not yet learned about. An “incorrectly” addressed 
server detects the addressing error by recomputing the 
partition identifier by re-hashing the filename. If this 
hashed filename does not belong in the partition it has, 
this server sends a split history update to the client. The 
client updates its cached version of the global index and 
retries the original request. 

The drawback of allowing inconsistent indices is that 
clients may need additional probes before addressing re- 
quests to the correct server. The required number of in- 
correct probes depends on the client request rate and the 
directory mutation rate (rate of splitting partitions). It is 
conceivable that a client with an empty index may send 
O(log(N,)) incorrect probes, where N, is the number of 
partitions, but GIGA+’s split history updates makes this 
many incorrect probes unlikely (described in Section 5.4). 
Each update sends the split histories of all partitions that 
reside on a given server, filling all gaps in the client index 
known to this server and causing client indices to catch up 
quickly. Moreover, after a directory stops splitting parti- 
tions, clients soon after will no longer incur any addressing 
errors. GIGA+’s eventual consistency for cached indices 
is different from LH*’s eventual consistency because the 
latter’s idea of independent splitting (called pre-splitting 
in their paper) suffers addressing errors even when the 
index stops mutating [36]. 


3.4 Handling server additions 


This section describes how GIGA+ adapts to the addition 
of servers in a running directory service.* 


When new servers are added to an existing configuration, 
the system is immediately no longer load balanced, and it 
should re-balance itself by migrating a minimal number of 
directory entries from all existing servers equally. Using 
the round-robin partition-to-server mapping, shown in 
Figure 1, a naive server addition scheme would require 
re-mapping almost all directory entries whenever a new 
server is added. 

GIGA+ avoids re-mapping all directory entries on ad- 
dition of servers by differentiating the partition-to-server 

4Server removal (i.e., decommissioned, not failed and later replaced) 


is not as important for high performance systems so we leave it to be 
done by user-level data copy tools. 


Original server configuration 


New servers 





Round-robin mapping 


sequential 

mapping 

Figure 2 — Server additions in GIGA+. To minimize the 
amount of data migrated, indicated by the arrows that show 
splits, GIGA+ changes the partition-to-server mapping from 
round-robin on the original server set to sequential on the newly 
added servers. 


mapping for initial directory growth from the mapping for 
additional servers. For additional servers, GIGA+ does 
not use the round-robin partition-to-server map (shown 
in Figure 1) and instead maps all future partitions to the 
new servers in a “sequential manner”. The benefit of 
round-robin mapping is faster exploitation of parallelism 
when a directory is small and growing, while a sequen- 
tial mapping for the tail set of partitions does not disturb 
previously mapped partitions more than is mandatory for 
load balancing. 


Figure 2 shows an example where the original configu- 
ration has 5 servers with 3 partitions each, and partitions 
Po to Pj4 use a round-robin rule (for P;, server is i mod 
N, where N is number of servers). After the addition of 
two servers, the six new partitions P,5-P29 will be mapped 
to servers using the new mapping rule: i div M, where 
M is the number of partitions per server (e.g., 3 parti- 
tions/server). 


In GIGA+ even the number of servers can be stale at 
servers and clients. The arrival of a new server and its 
order in the global server list is declared by the cluster 
file system’s configuration management protocol, such as 
Zookeeper for HDFS [29], leading to each existing server 
eventually noticing the new server. Once it knows about 
new servers, an existing server can inspect its partitions 
for those that have sufficient directory entries to warrant 
splitting and would split to a newly added server. The 
normal GIGA+ splitting mechanism kicks in to migrate 
only directory entries that belong on the new servers. The 
order in which an existing server inspects partitions can 
be entirely driven by client references to partitions, bias- 
ing migration in favor of active directories. Or based on 
an administrator control, it can also be driven by a back- 
ground traversal of a list of partitions whose size exceeds 
the splitting threshold. 
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4 GIGA+ Implementation 


GIGA+ indexing mechanism is primarily concerned with 
distributing the contents and work of large file system 
directories over many servers, and client interactions with 
these servers. It is not about the representation of directory 
entries on disk, and follows the convention of reusing 
mature local file systems like ext3 or ReiserFS (in Linux) 
for disk management found as is done by many modern 
cluster file systems [39, 46, 54, 63, 66]. 


The most natural implementation strategy for GIGA+ 
is as an extension of the directory functions of a cluster 
file system. GIGA+ is not about striping the data of huge 
files, server failure detection and failover mechanism, or 
RAID /replication of data for disk fault tolerance. These 
functions are present and, for GIGA+ purposes, adequate 
in most cluster file systems. Authors of a new version of 
PVFS, called OrangeFS, and doing just this by integrating 
GIGA+ into OrangeFS [2, 45]. Our goal is not to compete 
with most features of these systems, but to offer technol- 
ogy for advancing their support of high rates of mutation 
of large collections of small files. 


For the purposes of evaluating GIGA+ on file system 
directory workloads, we have built a skeleton cluster file 
system; that is, we have not implemented data striping, 
fault detection or RAID in our experimental framework. 
Figure 3 shows our user-level GIGA+ directory prototypes 
built using the FUSE API [19]. Both client and server pro- 
cesses run in user-space, and communicate over TCP using 
SUN RPC [56]. The prototype has three layers: unmodi- 
fied applications running on clients, the GIGA+ indexing 
modules (of the skeletal cluster file system on clients and 
servers) and a backend persistent store at the server. Ap- 
plications interact with a GIGA+ client using the VFS 
API (e.g., open (), creat () and close () syscalls). 
The FUSE kernel module intercepts and redirects these 
VFS calls the client-side GIGA+ indexing module which 
implements the logic described in the previous section. 


4.1 Server implementation 


The GIGA+ server module’s primary purpose is to syn- 
chronize and serialize interactions between all clients and 
a specific partition. It need not “store” the partitions, but 
it owns them by performing all accesses to them. Our 
server-side prototype is currently layered on lower level 
file systems, ext3 and ReiserFS. This decouples GIGA+ 
indexing mechanisms from on-disk representation. 


Servers map logical GIGA+ partitions to directory ob- 
jects within the backend file system. For a given (huge) 
directory, its entry in its parent directory names the "ze- 
roth partition", fen which is a directory in a server’s 
underlying file system. Most directories are not huge and 


will be represented by just this one zeroth partition. 


GIGA+ stores some information as extended attributes 
on the directory holding a partition: a GIGA+ directory ID 
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Figure 3 — GIGA+ experimental prototype. 


(unique across servers), the the partition identifier P; and 
its range (x,y]. The range implies the leaf in the directory’s 
logical tree view of the huge directory associated with 
this partition (the center column of Figure 1) and that 
determines the prior splits that had to have occurred to 
cause this partition to exist (that is, the split history). 


To associate an entry in a cached index (a partition) with 
a specific server, we need the list of servers over which 
partitions are round robin allocated and the list of servers 
over which partitions are sequentially allocated. The set 
of servers that are known to the cluster file system at the 
time of splitting the zeroth partition is the set of servers 
that are round robin allocated for this directory and the set 
of servers that are added after a zeroth partition is split are 
the set of servers that are sequentially allocated.° 


Because the current list of servers will always be avail- 
able in a cluster file system, only the list of servers at the 
time of splitting the zeroth server needs to be also stored 
in a partition’s extended attributes. Each split propagates 
the directory ID and set of servers at the time of the zeroth 
partition split to the new partition, and sets the new parti- 
tion’s identifier P; and range (x,y] as well as providing the 
entries from the parent partition that hash into this range 
(x,y). 

Each partition split is handled by the GIGA+ server by 
locally locking the particular directory partition, scanning 
its entries to build two sub-partitions, and then transac- 
tionally migrating ownership of one partition to another 
server before releasing the local lock on the partition [55]. 
In our prototype layered on local file systems, there is no 
transactional migration service available, so we move the 
directory entries and copy file data between servers. Our 
experimental splits are therefore more expensive than they 
should be in a production cluster file system. 


4.2 Client implementation 


The GIGA+ client maintains cached information, some 
potentially stale, global to all directories. It caches the cur- 
rent server list (which we assume only grows over time) 


>The contents of a server list are logical server IDs (or names) that are 
converted to IP addresses dynamically by a directory service integrated 
with the cluster file system. Server failover (and replacement) will bind a 
different address to the same server ID so the list does not change during 
normal failure handling. 
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and the number of partitions per server (which is fixed) 
obtained from whichever server GIGA+ was mounted on. 
For each active directory GIGA+ clients cache the cluster- 
wide i-node of the zeroth partition, the directory ID, and 
the number of servers at the time when the zeroth parti- 
tion first split. The latter two are available as extended 
attributes of the zeroth partition. Most importantly, the 
client maintains a bitmap of the global index built accord- 
ing to Section 3, and a maximum tree-depth, r = [log(i)], 
of any partition P; present in the global index. 


Searching for a file name with a specific hash value, 
H, is done by inspecting the index bitmap at the offset j 
determined by the r lower-order bits of H. If this is set 
to ‘1’ then # is in partition P;. If not, decrease r by one 
and repeat until r = 0 which refers to the always known 
zeroth partition P. Identifying the server for partition P; 
is done by lookup in the current server list. It is either 
jmodN, where N is the number of servers at the time the 
zeroth partition split), or j;divM, where M is the number 
of partitions per server, with the latter used if j exceeds 
the product of the number of servers at the time of zeroth 
partition split and the number of partitions per server. 


Most VFS operations depend on lookups; readdir () 
however can be done by walking the bitmaps, enumer- 
ating the partitions and scanning the directories in the 
underlying file system used to store partitions. 


4.3 Handling failures 


Modern cluster file systems scale to sizes that make fault 
tolerance mandatory and sophisticated [8, 20, 65]. With 
GIGA+ integrated in a cluster file system, fault tolerance 
for data and services is already present, and GIGA+ does 
not add major challenges. In fact, handling network parti- 
tions and client-side reboots are relatively easy to handle 
because GIGA+ tolerates stale entries in a client’s cached 
index of the directory partition-to-server mapping and be- 
cause GIGA+ does not cache directory entries in client 
or server processes (changes are written through to the 
underlying file system). Directory-specific client state can 
be reconstructed by contacting the zeroth partition named 
in a parent directory entry, re-fetching the current server 
list and rebuilding bitmaps through incorrect addressing 
of server partitions during normal operations. 


Other issues, such as on-disk representation and disk 
failure tolerance, are a property of the existing cluster file 
system’s directory service, which is likely to be based on 
replication even when large data files are RAID encoded 
[66]. Moreover, if partition splits are done under a lock 
over the entire partition, which is how our experiments are 
done, the implementation can use a non-overwrite strategy 
with a simple atomic update of which copy is live. As a 
result, recovery becomes garbage collection of spurious 
copies triggered by the failover service when it launches 
a new server process or promotes a passive backup to be 
the active server [9, 29, 65]. 


While our architecture presumes GIGA+ is integrated 
into a full featured cluster file system, it is possible to layer 
GIGA+ as an interposition layer over and independent of a 
cluster file system, which itself is usually layered over mul- 
tiple independent local file systems [20, 46, 54, 66]. Such 
a layered GIGA+ would not be able to reuse the fault toler- 
ance services of the underlying cluster file system, leading 
to an extra layer of fault tolerance. The primary function 
of this additional layer of fault tolerance is replication 
of the GIGA+ server’s write-ahead logging for changes 
it is making in the underlying cluster file system, detec- 
tion of server failure, election and promotion of backup 
server processes to be primaries, and reprocessing of the 
replicated write-ahead log. Even the replication of the 
write-ahead log may be unnecessary if the log is stored in 
the underlying cluster file system, although such logs are 
often stored outside of cluster file systems to improve the 
atomicity properties writing to them [12, 26]. To ensure 
load balancing during server failure recovery, the layered 
GIGA+ server processes could employ the well-known 
chained-declustering replication mechanism to shift work 
among server processes [28], which has been used in other 
distributed storage systems [33, 60]. 


5 Experimental Evaluation 


Our experimental evaluation answers two questions: (1) 
How does GIGA+ scale? and (2) What are the tradeoffs 
of GIGA+’s design choices involving incremental growth, 
weak index consistency and selection of the underlying 
local file system for out-of-core indexing (when partitions 
are very large)? 

All experiments were performed on a cluster of 64 ma- 
chines, each with dual quad-core 2.83GHz Intel Xeon 
processors, 16GB memory and a 10GigE NIC, and Arista 
10 GigE switches. All nodes were running the Linux 
2.6.32-js6 kernel (Ubuntu release) and GIGA+ stores par- 
titions as directories in a local file system on one 7200rpm 
SATA disk (a different disk is used for all non-GIGA+ 
storage). We assigned 32 nodes as servers and the remain- 
ing 32 nodes as load generating clients. The threshold for 
splitting a partition is always 8,000 entries. 

We used the synthetic mdtest benchmark [41] (used 
by parallel file system vendors and users) to insert zero- 
byte files in to a directory [27, 63]. We generated three 
types of workloads. First, a concurrent create workload 
that creates a large number of files concurrently in a single 
directory. Our configuration uses eight processes per client 
to simultaneously create files in a common directory, and 
the number of files created is proportional to the number of 
servers: a single server manages 400,000 files, a 800,000 
file directory is created on 2 servers, a 1.6 million file 
directory on 4 servers, up to a 12.8 million file directory 
on 32 servers. Second, we use a lookup workload that 
performs a stat () on random files in the directory. And 
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File creates/second 


File System in one directory 








GIGA+ Library API 17,902 
(layered on Reiser) VFS/FUSE API 5,977 
Local Linux ext3 16,470 
file systems Linux ReiserFS 20,705 
Networked NFSv3 filer 521 
file systems HadoopFS 4,290 

PVFS 1,064 


Table 1 — File create rate in a single directory on a single 
server. An average of five runs (with 1% standard deviation). 


finally, we use a mixed workload where clients issue create 
and lookup requests in a pre-configured ratio. 


5.1 Scale and performance 


We begin with a baseline for the performance of various 
file systems running the mdtest benchmark. First we 
compare mdtest running locally on Linux ext3 and Reis- 
erFS local file systems to mdtest running on a separate 
client and single server instance of the PVFS cluster file 
system (using ext3) [46], Hadoop’s HDFS (using ext3) 
[54] and a mature commercial NFSv3 filer. In this experi- 
ment GIGA+ always uses one partition per server. Table 1 
shows the baseline performance. 


For GIGA+ we use two machines with ReiserFS on 
the server and two ways to bind mdtest to GIGA+: di- 
rect library linking (non-POSIX) and VFS/FUSE linkage 
(POSIX). The library approach allows mdtest to use 
custom object creation calls (such as giga_creat ()) 
avoiding system call and FUSE overhead in order to com- 
pare to mdtest directly in the local file system. Among 
the local file systems, with local mdt est threads generat- 
ing file creates, both ReiserFS and Linux ext3 deliver high 
directory insert rates.° Both file systems were configured 
with -noat ime and —nodirat ime option; Linux ext3 
used write-back journaling and the dir_index option 
to enable hashed-tree indexing, and ReiserFS was config- 
ured with the -notail option, a small-file optimization 
that packs the data inside an i-node for high performance 
[47]. GIGA+ with mdtest workload generating threads 
on a different machine, when using the library interface 
(sending only one RPC per create) and ReiserFS as the 
backend file system, creates at better than 80% of the 
rate of ReiserFS with local load generating threads. This 
comparison shows that remote RPC is not a huge penalty 
for GIGA+. We tested this library version only to gauge 
GIGA+ efficiency compared to local file systems and do 
not use this setup for any remaining experiments. 

To compare with the network file systems, GIGA+ 
uses the VFS/POSIX interface. In this case each VFS 


We tried XFS too, but it was extremely slow during the create- 
intensive workload and do not report those numbers in this paper. 
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Figure 4 — Scalability of GIGA+ FS directories. GIGA+ direc- 
tories deliver a peak throughput of roughly 98,000 file creates per 
second. The behavior of underlying local file system (ReiserFS) 
limits GIGA+’s ability to match the ideal linear scalability. 


file creat () results in three RPC calls to the server: 
getattr() tocheck if a file exists, the actual creat () 
and another getattr() after creation to load the cre- 
ated file’s attributes. For a more enlightening comparison, 
cluster file systems were configured to be functionally 
equivalent to the GIGA+ prototype; specifically, we dis- 
abled HDFS’s write-ahead log and replication, and we 
used default PVFS which has no redundancy unless a 
RAID controller is added. For the NFSv3 filer, because 
it was in production use, we could not disable its RAID 
redundancy and it is correspondingly slower than it might 
otherwise be. GIGA+ directories using the VFS/FUSE 
interface also outperforms all three networked file systems, 
probably because the GIGA+ experimental prototype is 
skeletal while others are complex production systems. 


Figure 4 plots aggregate operation throughput, in file 
creates per second, averaged over the complete concurrent 
create benchmark run as a function of the number of 
servers (on a log-scale X-axis). GIGA+ with partitions 
stored as directories in ReiserFS scales linearly up to the 
size of our 32-server configuration, and can sustain 98,000 
file creates per second - this exceeds today’s most rigorous 
scalability demands [44]. 


Figure 4 also compares GIGA+ with the scalability of 
the Ceph file system and the HBase distributed key-value 
store. For Ceph, Figure 4 reuses numbers from experi- 
ments performed on a different cluster from the original 
paper [63]. That cluster used dual-core 2.4GHz machines 
with IDE drives, with equal numbered separate nodes as 
workload generating clients, metadata servers and disk 
servers with object stores layered on Linux ext3. HBase 
is used to emulate Google’s Colossus file system which 
plans to store file system metadata in BigTable instead 
of internally on single master node[18]. We setup HBase 
on a 32-node HDFS configuration with a single copy (no 
replication) and disabled two parameters: blocking while 
the HBase servers are doing compactions and write-ahead 
logging for inserts (a common practice to speed up insert- 
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Figure 5 — Incremental scale-out growth. GIGA+ achieves lin- 
ear scalability after distributing one partition on each available 
server. During scale-out, periodic drops in aggregate create rate 
correspond to concurrent splitting on all servers. 


ing data in HBase). This configuration allowed HBase 
to deliver better performance than GIGA+ for the single 
server configuration because the HBase tables are striped 
over all 32-nodes in the HDFS cluster. But configurations 
with many HBase servers scale poorly. 


GIGA+ also demonstrated scalable performance for the 
concurrent lookup workload delivering a throughput of 
more than 600,000 file lookups per second for our 32- 
server configuration (not shown). Good lookup perfor- 
mance is expected because the index is not mutating and 
load is well-distributed among all servers; the first few 
lookups fetch the directory partitions from disk into the 
buffer cache and the disk is not used after that. Section 
5.4 gives insight on addressing errors during mutations. 


5.2 Incremental scaling properties 


In this section, we analyze the scaling behavior of the 
GIGA+ index independent of the disk and the on-disk di- 
rectory layout (explored later in Section 5.5). To eliminate 
performance issues in the disk subsystem, we use Linux’s 
in-memory file system, tmpfs, to store directory parti- 
tions. Note that we use tmpfs only in this section, all 
other analysis uses on-disk file systems. 


We run the concurrent create benchmark to create a 
large number of files in an empty directory and measure 
the aggregate throughput (file creates per second) continu- 
ously throughout the benchmark. We ask two questions 
about scale-out heuristics: (1) what is the effect of split- 
ting during incremental scale-out growth? and (2) how 
many partitions per server do we keep? 


Figure 5 shows the first 8 seconds of the concurrent 
create workload when the number of partitions per server 
is one. The primary result in this figure is the near linear 
create rate seen after the initial seconds. But the initial 
few seconds are more complex. In the single server case, 
as expected, the throughput remains flat at roughly 7,500 
file creates per second due to the absence of any other 
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Figure 6 — Effect of splitting heuristics. GIGA+ shows that 
splitting to create at most one partition on each of the 16 servers 
delivers scalable performance. Continuous splitting, as in clas- 
sic database indices, is detrimental in a distributed scenario. 


server. In the 2-server case, the directory starts on a single 
server and splits when it has more than 8,000 entries in 
the partition. When the servers are busy splitting, at the 
0.8-second mark, throughput drops to half for a short time. 


Throughput degrades even more during the scale-out 
phase as the number of directory servers goes up. For 
instance, in the 8-server case, the aggregate throughput 
drops from roughly 25,000 file creates/second at the 3- 
second mark to as low as couple of hundred creates/second 
before growing to the desired 50,000 creates/second. This 
happens because all servers are busy splitting, i.e., parti- 
tions overflow at about the same time which causes all 
servers (where these partitions reside) to split without any 
co-ordination at the same time. And after the split spreads 
the directory partitions on twice the number of servers, the 
aggregate throughput achieves the desired linear scale. 


In the context of the second question about how many 
partitions per server, classic hash indices, such as ex- 
tendible and linear hashing [17, 34], were developed for 
out-of-core indexing in single-node databases. An out-of- 
core table keeps splitting partitions whenever they over- 
flow because the partitions correspond to disk allocation 
blocks [23]. This implies an unbounded number of par- 
titions per server as the table grows. However, the splits 
in GIGA+ are designed to parallelize access to a directory 
by distributing the directory load over all servers. Thus 
GIGA+ can stop splitting after each server has a share 
of work, i.e., at least one partition. When GIGA+ limits 
the number of partitions per server, the size of partitions 
continue to grow and GIGA+ lets the local file system 
on each server handle physical allocation and out-of-core 
memory management. 


Figure 6 compares the effect of different policies for the 
number of partitions per server on the system throughput 
(using a log-scale Y-axis) during a test in which a large di- 
rectory is created over 16 servers. Graph (a) shows a split 
policy that stops when every server has one partition, caus- 
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Figure 7 — Load-balancing in GIGA+. These graphs show the 
quality of load balancing measured as the mean load deviation 
across the entire cluster (with 95% confident interval bars). Like 
virtual servers in consistent hashing, GIGA+ also benefits from 
using multiple hash partitions per server. GIGA+ needs one to 
two orders of magnitude fewer partitions per server to achieve 
comparable load distribution relative to consistent hashing. 


ing partitions to ultimately get much bigger than 8,000 
entries. Graph (b) shows the continuous splitting policy 
used by classic database indices where a split happens 
whenever a partition has more than 8,000 directory entries. 
With continuous splitting the system experiences periodic 
throughput drops that last longer as the number of parti- 
tions increases. This happens because repeated splitting 
maps multiple partitions to each server, and since uniform 
hashing will tend to overflow all partitions at about the 
same time, multiple partitions will split on all the servers 
at about the same time. 


Lesson #1: To avoid the overhead of continuous split- 
ting in a distributed scenario, GIGA+ stops splitting a 
directory after all servers have a fixed number of partitions 
and lets a server’s local file system deal with out-of-core 
management of large partitions. 


5.3. Load balancing efficiency 


The previous section showed only configurations where 
the number of servers is a power-of-two. This is a spe- 
cial case because it is naturally load-balanced with only a 
single partition per server: the partition on each server is 
responsible for a hash-range of size 2’-th part of the total 
hash-range (0, 1]. When the number of servers is not a 
power-of-two, however, there is load imbalance. Figure 7 
shows the load imbalance measured as the average frac- 
tional deviation from even load for all numbers of servers 
from | to 32 using Monte Carlo model of load distribu- 
tion. In a cluster of 10 servers, for example, each server is 
expected to handle 10% of the total load; however, if two 
servers are experiencing 16% and 6% of the load, then 
they have 60% and 40% variance from the average load 
respectively. For different cluster sizes, we measure the 
variance of each server, and use the average (and 95% 
confidence interval error bars) over all the servers. 
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Figure 8 — Cost of splitting partitions. Using 4, 8, or 16 parti- 
tions per server improves the performance of GIGA+ directories 
layered on Linux ext3 relative to 1 partition per server (better 
load-balancing) or 32 partitions per server (when the cost of 
more splitting dominates the benefit of load-balancing). 


We compute load imbalance for GIGA+ in Figure 7(a) 
as follows: when the number of servers N is not a power- 
of-two, 2” < N < 2"+! then a random set of N — 2” par- 
titions from tree depth r, each with range size 1/2”, will 
have split into 2(N — 2") partitions with range size 1/2"+1. 
Figure 7(a) shows the results of five random selections 
of N — 2’ partitions that split on to the r+ 1 level. Fig- 
ure 7(a) shows the expected periodic pattern where the 
system is perfectly load-balanced when the number of 
servers is a power-of-two. With more than one partition 
per server, each partition will manage a smaller portion 
of the hash-range and the sum of these smaller partitions 
will be less variable than a single large partition as shown 
in Figure 7(a). Therefore, more splitting to create more 
than one partition per server significantly improves load 
balance when the number of servers is not a power-of-two. 


Multiple partitions per server is also used by Amazon’s 
Dynamo key-value store to alleviate the load imbalance 
in consistent hashing [15]. Consistent hashing associates 
each partition with a random point in the hash-space (0, 1] 
and assigns it the range from this point up to the next 
larger point and wrapping around, if necessary. Figure 7(b) 
shows the load imbalance from Monte Carlo simulation 
of using multiple partitions (virtual servers) in consistent 
hashing by using five samples of a random assignment 
for each partition and how the sum, for each server, of 
partition ranges selected this way varies across servers. 
Because consistent hashing’s partitions have more ran- 
domness in each partition’s hash-range, it has a higher 
load variance than GIGA+ — almost two times worse. In- 
creasing the number of hash-range partitions significantly 
improves load distribution, but consistent hashing needs 
more than 128 partitions per server to match the load vari- 
ance that GIGA+ achieves with 8 partitions per server — an 
order of magnitude more partitions. 

More partitions is particularly bad because it takes 
longer for the system to stop splitting, and Figure 8 shows 
how this can impact overall performance. Consistent hash- 
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Figure 9 — Cost of using inconsistent mapping at the clients. 
Using weak consistency for mapping state at the clients has a 
very negligible overhead on client performance (a). And this 
overhead — extra message re-addressing hops — occurs for initial 
requests until the client learns about all the servers (b and c). 


ing theory has alternate strategies for reducing imbalance 
but these often rely on extra accesses to servers all of the 
time and global system state, both of which will cause 
impractical degradation in our system [10]. 


Since having more partitions per server always improves 
load-balancing, at least a little, how many partitions should 
GIGA+ use? Figure 8 shows the concurrent create bench- 
mark time for GIGA+ as a function of the number of 
servers for 1, 4, 8, 16 and 32 partitions per server. We ob- 
serve that with 32 partitions per server GIGA+ is roughly 
50% slower than with 4, 8 and 16 partitions per server. 
Recall from Figure 7(a) that the load-balancing efficiency 
from using 32 partitions per server is only about 1% bet- 
ter than using 16 partitions per server; the high cost of 
splitting to create twice as many partitions outweighs the 
minor load-balancing improvement. 


Lesson #2: Splitting to create more than one partition 
per server significantly improves GIGA+ load balancing 
for non power-of-two numbers of servers, but because of 
the performance penalty during extra splitting the overall 
performance is best with only a few partitions per server. 


5.4 Cost of weak mapping consistency 


Figure 9(a) shows the overhead incurred by clients when 
their cached indices become stale. We measure the per- 
centage of all client requests that were re-routed when run- 
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ning the concurrent create benchmark on different cluster 
sizes. This figure shows that, in absolute terms, fewer than 
0.05% of the requests are addressed incorrectly; this is 
only about 200 requests per client because each client is 
doing 400,000 file creates. The number of addressing er- 
rors increases proportionally with the number of partitions 
per server because it takes longer to create all partitions. In 
the case when the number of servers is a power-of-two, af- 
ter each server has at least one partition, subsequent splits 
yield two smaller partitions on the same server, which will 
not lead to any additional addressing errors. 


We study further the worst case in Figure 9(a), 30 servers 
with 16 partitions per server, to learn when addressing er- 
rors occur. Figure 9(b) shows the number of errors encoun- 
tered by each request generated by one client thread (i.e., 
one of the eight workload generating threads per client) as 
it creates 50,000 files in this benchmark. Figure 9(b) sug- 
gests three observations. First, the index update that this 
thread gets from an incorrectly addressed server is always 
sufficient to find the correct server on the second probe. 
Second, that addressing errors are bursty, one burst for 
each level of the index tree needed to create 16 partitions 
on each of 30 servers, or 480 partitions (28 < 480 < 2). 
And finally, that the last 80% of the work is done after the 
last burst of splitting without any addressing errors. 


To further emphasize how little incorrect server address- 
ing clients generate, Figure 9(c) shows the addressing 
experience of a new client issuing 10,000 lookups after 
the current create benchmark has completed on 30 servers 
with 16 partitions per server.’ This client makes no more 
than 3 addressing errors for a specific request, and no 
more than 30 addressing errors total and makes no more 
addressing errors after the 40th request. 


Lesson #3: GIGA+ clients incur neglible overhead (in 
terms of incorrect addressing errors) due to stale cached 
indices, and no overhead shortly after the servers stop 
splitting partitions. Although not a large effect, fewer 
partitions per server lowers client addressing errors. 


5.5 Interaction with backend file systems 


Because some cluster file systems represent directories 
with equivalent directories in a local file system [39] and 
because our GIGA+ experimental prototype represents 
partitions as directories in a local file system, we study 
how the design and implementation of Linux ext3 and 
ReiserFS local file systems affects GIGA+ partition splits. 
Although different local file system implementations can 
be expected to have different performance, especially for 
emerging workloads like ours, we were surprised by the 
size of the differences. 


Figure 10 shows GIGA+ file create rates when there are 
16 servers for four different configurations: Linux ext3 
7Figure 9 predicts the addressing errors of a client doing only 


lookups on a mutating directory because both create (filename) 
and lookup (filename) do the same addressing. 


: 9th USENIX Conference on File and Storage Technologies 


187 


188 


1 partition per server:on ReiserFS 
AOOO0O'=-s:ar5lccrecastenscoaas cesses hecteanctecdngs ti tdecessanenrassdsteoon 





10,000 


G00; ecpons el ssssslsccivadlsexanebeetntndlll sca! ojase lecstsn2li caps 


16 partitions per server on ReiserFS : 





1 partition per server.on ext3 
100,000 + SE scat cee tea eo 











Number of files created per second (on 16 servers) 





16 partitions per server on ext3 





tT t 
20 40 60 


80 100 120 140 160 180 
Running Time (seconds) 


100 200 300 400 500 600 700 
Running Time (seconds) 


Figure 10 — Effect of underlying file systems. This graph shows the concurrent create benchmark behavior when the GIGA+ 
directory service is distributed on 16 servers with two local file systems, Linux ext3 and ReiserFS. For each file system, we show two 


different numbers of partitions per server, 1 and 16. 


or ReiserFS storing partitions as directories, and 1 or 16 
partitions per server. Linux ext3 directories use h-trees 
[11] and ReiserFS uses balanced B-trees [47]. We ob- 
served two interesting phenomenon: first, the benchmark 
running time varies from about 100 seconds to over 600 
seconds, a factor of 6, and second, the backend file system 
yielding the faster performance is different when there are 
16 partitions on each server than with only one. 


Comparing a single partition per server in GIGA+ over 
ReiserFS and over ext3 (left column in Figure 10), we ob- 
serve that the benchmark completion time increases from 
about 100 seconds using ReiserFS to nearly 170 seconds 
using ext3. For comparison, the same benchmark com- 
pleted in 70 seconds when the backend was the in-memory 
tmpfs file system. Looking more closely at Linux ext3, 
as a directory grows, ext3’s journal also grows and period- 
ically triggers ext3’s k journald daemon to flush a part 
of the journal to disk. Because directories are growing 
on all servers at roughly the same rate, multiple servers 
flush their journal to disk at about the same time leading 
to troughs in the aggregate file create rate. We observe 
this behavior for all three journaling modes supported by 
ext3. We confirmed this hypothesis by creating an ext3 
configuration with the journal mounted on a second disk 
in each server, and this eliminated most of the throughput 
variability observed in ext3, completing the benchmark 
almost as fast as with ReiserFS. For ReiserFS, however, 
placing the journal on a different disk had little impact. 


The second phenomenon we observe, in the right col- 
umn of Figure 10, is that for GIGA+ with 16 partitions 
per server, ext3 (which is insensitive to the number of par- 
titions per server) completes the create benchmark more 
than four times faster than ReiserFS. We suspect that this 
results from the on-disk directory representation. Reis- 
erFS uses a balanced B-tree for all objects in the file 
system, which re-balances as the file system grows and 
changes over time [47]. When partitions are split more 
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often, as in case of 16 partitions per server, the backend 
file system structure changes more, which triggers more 
re-balancing in ReiserFS and slows the create rate. 


Lesson #4: Design decisions of the backend file system 
have subtle but large side-effects on the performance of a 
distributed directory service. Perhaps the representation 
of a partition should not be left to the vagaries of whatever 
local file system is available. 


6 Conclusion 


In this paper we address the emerging requirement for 
POSIX file system directories that store massive number 
of files and sustain hundreds of thousands of concurrent 
mutations per second. The central principle of GIGA+ 
is to use asynchrony and eventual consistency in the dis- 
tributed directory’s internal metadata to push the limits of 
scalability and concurrency of file system directories. We 
used these principles to prototype a distributed directory 
implementation that scales linearly to best-in-class per- 
formance on a 32-node configuration. Our analysis also 
shows that GIGA+ achieves better load balancing than 
consistent hashing and incurs a neglible overhead from 
allowing stale lookup state at its clients. 
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Abstract 


Dispersing files across multiple sites yields a variety of 
obvious benefits, such as availability, proximity and reli- 
ability. Less obviously, it enables security to be achieved 
without relying on encryption keys. Standard approaches 
to dispersal either achieve very high security with corre- 
spondingly high computational and storage costs, or low 
security with lower costs. In this paper, we describe a 
new dispersal scheme, called AONT-RS, which blends an 
All-Or-Nothing Transform with Reed-Solomon coding 
to achieve high security with low computational and stor- 
age costs. We evaluate this scheme both theoretically and 
as implemented with standard open source tools. AONT- 
RS forms the backbone of a commercial dispersed stor- 
age system, which we briefly describe and then use as a 
further experimental testbed. We conclude with details 
of actual deployments. 


1 Introduction 


Dispersed storage systems coalesce multiple storage sites 
into a collective whole. Files are decomposed into 
smaller blocks which are computationally massaged and 
then dispersed to the storage sites. When a client desires 
to read a file, it retrieves some subset of the blocks, which 
are combined to reconstitute the original file. Compared 
to traditional single-site storage systems, dispersed stor- 
age systems offer a variety of benefits. Multiple indepen- 
dent storage sites offer greater availability than a single 
site, since they have no single point of failure. When 
sites are physically distributed across a wide area, they 
offer physical proximity to distributed clients, which can 
improve performance and scalability. Finally, the mas- 
saging of data typically includes adding redundancy in 
the form of erasure codes or secret sharing, which im- 
proves reliability in the face of failures. 

There have been many dispersed storage systems 
developed in the past ten years. Examples in- 
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clude storage systems such as Oceanstore [23], Perga- 
mum [29], POTSHARDS [30], PASIS [9], Gridshar- 
ing [31], Glacier [11], Cleversafe [4] and Tahoe- 
LAFS [32] among others. Related to dispersed storage 
systems are distributed or peer-to-peer storage systems 
which use replication rather than coding to achieve relia- 
bility. Examples include LOCKSS [14], Google file sys- 
tem [8], Elephant [27], PAST [26] and BitTorrent [5]. 

A side benefit of dispersal is the ability to provide 
security without the use of encryption keys. The basic 
techniques are classics from computer science literature: 
Shamir’s secret sharing [28] and Rabin’s information dis- 
persal based on non-systematic erasure codes [21]. Each 
technique is a (k,n) threshold scheme: The storage sys- 
tem transforms a file into n distinct blocks. A client or 
attacker must retrieve at least k of the n blocks to re- 
construct the file. With fewer than k blocks, the client 
or attacker gets no information. Several of the above- 
mentioned systems [9, 30, 31] use these techniques to 
achieve security by storing each of the n pieces at a dif- 
ferent site, and assuming that an attacker will not be able 
to authenticate himself to at least k of them. This avoids 
encryption strategies which require the secure storage 
of encryption keys, a difficult and dangerous practice 
(see [30] for a thorough discussion of this problem). 

Each technique achieves a different level of security 
with different performance and storage requirements. If 


the original file is b bytes in size, Shamir’s scheme re- 
nb 


quires a total of nb bytes, while Rabin’s requires 7. 


Shamir’s requires more computation as well. To com- 
pensate for the extra storage and computation, Shamir’s 
scheme is more secure, achieving information theoretic 
security. Rabin’s security is far less, and would be unac- 
ceptable in many environments. 

In this paper, we describe a further modification to 
Rabin’s scheme that achieves improved computational 
performance, security and integrity. We achieve this by 
combining the All-Or-Nothing Transform (AONT) [24] 
with systematic Reed-Solomon erasure codes [13]. 
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Hence, we call it AONT-RS. We describe the technique, 
evaluate it both theoretically and experimentally and de- 
tail how it fits into a commercial dispersed storage sys- 
tem. We conclude with some field data of actual deploy- 
ments. 


2 Dispersal Algorithms 


At the heart of all (k,n) threshold schemes (which we 
heretofore call dispersal algorithms) is a matrix-vector 
product, illustrated in Figure 1. The data to be stored is 
broken into words or elements which are w bits in length. 
A generator or dispersal matrix G is created, which has n 
rows and k columns. This matrix is multiplied by a k- 
element vector D (called the data or message) to yield 
a n-element vector C called the codeword. Each element 
of the codeword is stored on a different storage node. 


= k columns —> 





a 2 2 
z 5 5 
e » ERB- 
= oO oO 
x = 
= | 
Data / 
G Message Cc 
Dispersal / Generator Codeword 
Matrix 


Figure 1: The central matrix-vector product for all dis- 
persal algorithms. 


The dispersal matrix is constructed so that all combi- 
nations of k rows yield invertible matrices. This gives 
us a technique to reconstruct D from any k surviving ele- 
ments of the codeword: each row of G corresponds to the 
calculation of a codeword element. We create a new k x k 
matrix A from the rows of G that correspond to the k sur- 
viving elements. We invert A and multiply A~! by the 
surviving elements to yield D. The construction of G 
guarantees that A is invertible. 

So that elements may fit into computer words, it is 
convenient that w be a power of two. To achieve this, 
we employ Galois Field arithmetic, GF (2"), where ad- 
dition is equal to bitwise exclusive-or (XOR) and mul- 
tiplication is implemented in a variety of ways either in 
hardware or software. In this way, dispersal is simply a 
variant of the well known Reed-Solomon codes [13, 22]. 
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A tutorial on implementing Reed-Solomon codes in this 
manner is available in [17], and a thorough discussion of 
implementing Galois Field arithmetic is provided in [10]. 
There is also a methodology that converts multiplica- 
tions into XOR’s described in [3]. There are open- 
source implementations of these codes and methodolo- 
gies in [16, 20, 25, 36]. 

Shamir’s secret sharing algorithm encodes w bits of 
data in dy. The remaining elements of D are randomly 
chosen w-bit words. The matrix G is a Vandermonde 
matrix, where gj; = i/, which guarantees that any k rows 
are invertible so long as n < 2” [28]. Thus, when one 
uses Shamir’s algorithm on a b-byte file, the total stor- 
age requirement is nb bytes, and the act of encoding re- 
quires O(knb) XOR and multiplication operations (we 
will characterize this further in Section 6 below). The se- 
curity guarantees of Shamir’s algorithm are very strong 
— even with an infinite amount of computing power, un- 
less an attacker has possession of k words, he cannot de- 
termine anything about the initial data. Moreover, this is 
done without the necessity of storing encryption keys. 

Rabin’s information dispersal algorithm (IDA) weak- 
ens the security, but improves both storage efficiency and 
performance. Each element of D now contains a word of 
data. Thus the storage requirement is we bytes, improv- 
ing both storage efficiency and encoding performance by 
a factor of k. Like Shamir, k elements of the codeword 
are required to reconstruct the original data. However, 
the security guarantees of Rabin are far less than Shamir. 
We will analyze this below in Section 5, but attackers 
looking for known or patterned data can find it more eas- 
ily from elements of the codeword. To combat this prob- 
lem, Rabin suggests a technique to generate the rows 
of G randomly, embed the row id’s within each code- 
word element, then encrypt the codewords [21]. Unfor- 
tunately, this requires storing an external encryption key, 
which does not solve the main problem we wish to solve 
(providing security without securely storing encryption 
keys). 

In 1993, Krawczyk proposed a blending of Rabin and 
Shamir, by encrypting the data with a key-based en- 
cryption algorithm, and then dispersing the encrypted 
data with an IDA and the key with a secret sharing 
scheme [12]. This is called Secret Sharing Made Short 
(SSMS). Our dispersal algorithm, described in the next 
section, also enriches Rabin’s IDA with security. Unlike 
SSMS, it does so without secret sharing, and with the 
integration of integrity checking for corruption. 


3 A New Dispersal Algorithm: AONT-RS 


We enrich Rabin’s IDA in two ways. First, we employ 
a variant of Rivest’s All-or-nothing Transform (AONT) 
as a preprocessing pass over the data [24]. The AONT 
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may be viewed as a (s+ 1,5+ 1) threshold scheme. Data 
composed of s words of size wa! is encoded into s+ 1 
different words so that none of the original words may 
be decoded unless all s+ 1 encoded words are present, or 
an attacker possesses enough computing power to crack 
an encryption key. The key, however, is encoded with 
the data. If a file’s size is b bytes, the performance of 
encoding is O(b). The benefits of the AONT are: 


e No external keys are necessary. 
e Very little extra storage is required. 


e The computational requirements of the attacker may 
be a parameter of the encoding. 


e The performance is good. 


The AONT works as follows. The data is composed 
of s words do,...,ds—1, each of which is wa bits in length. 
A random key K is chosen, and each codeword c; is cal- 
culated as: 

cj =d;0E(K,i+1), 


where F is a key-based encryption algorithm such as 
AES [6]. A final codeword, c;, is calculated to be a func- 
tion of K and a hash of the other codewords. The AONT 
has computational security, which means that unless an 
attacker possesses all s+ 1 codewords or can guess K, the 
attacker cannot get information about any word or data. 
We will discuss this further in section 5 below. 

We modify this scheme slightly. We add an extra word 
of data ds, called a canary [2]. This word has a known, 
fixed value, which allows us to check the integrity of the 
data when it is decoded. 

We generate co,...,Cs aS described above and then cal- 
culate a hash h of the s+ 1 codewords using a standard 
hash algorithm such as SHA-256 [15] having an output 
at least as long as K. We then calculate a final block c.+ 
as: 


Cs4 1 = K@h. 


Our second modification of Rabin’s IDA is to employ 
a systematic erasure code instead of a non-systematic 
one. A systematic code is defined to be one where the 
codeword contains the original elements of D. Without 
loss of generality, the first k elements of C are equal to 
the elements of D: c; = d; for0 <i<k. This means that 
the first k rows of G compose a k x k identity matrix as 
pictured in Figure 2. 

Employing a systematic erasure code instead of a non- 
systematic one (as in both the Shamir and Rabin algo- 
rithms) improves performance because it eliminates the 


'Since AONT-RS mixes AONT with dispersal, we differentiate its 
word size from the dispersal’s word size using wy instead of w. 
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Figure 2: A systematic erasure code. 


need to encode the first k codewords. Since many sys- 
tems use values of k that are large relative to n (e.g. POT- 
SHARDS’ evaluation uses a (3,5) Shamir scheme [30]) 
the savings during encoding with a systematic erasure 
code are substantial. Moreover, when decoding, code- 
word elements that are equal to data elements do not have 
to be decoded, which improves performance further. 

We call our dispersal technique AONT-RS, as it is a 
combination of the All-Or-Nothing Transform and Reed- 
Solomon coding. The intuition is that we use the AONT 
for security and the dispersal for availability, proximity 
and fault-tolerance. This is unlike Shamir, Rabin and 
SSMS which use dispersal to achieve both functions. 

Data and 


vow] —2 Cipher — 
Canary — | 


Lm 


Encrypted 





Figure 3: Encoding operation of AONT. 


Several diagrams depict the operation of AONT-RS 
and interaction between AONT and Reed-Solomon cod- 
ing. In Figure 3, data is processed by AONT. A canary 
is appended to the data, and the data and canary are en- 
crypted with a random key. A hash value of the encrypted 
data is computed. The hash value and random key are 
then combined via bitwise exclusive-or to form a differ- 
ence, which is appended to the encrypted data to form 
the AONT package. 

Once processed by AONT, the result is treated as nor- 
mal input to a systematic IDA, as depicted in Figure 4. 
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Figure 4: Dispersal of AONT package using a systematic 
IDA such as Reed-Solomon coding. 
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Figure 5: Recovering the AONT package from a thresh- 
old number of slices. 


The IDA splits the input into k slices formed directly 
from the input and computes n— k coding slices. Slices 
are then stored to separate locations. 

At a future time, slices may be retrieved and used to 
recover the data. The first step in this process requires 
obtaining a threshold number of slices, as in Figure 5. 
Short of a threshold number of slices the entire AONT 
package cannot be recovered; there is not enough infor- 
mation contained in m < k slices to yield the original in- 
put, whose length is k times the slice length. However, if 
one possesses any k of the slices, they may compute the 
original input to the IDA which in this case is the AONT 
package. 

As shown in Figure 6, Reversing the AONT operation 
is trivial when one possesses the entire package. The 
first step is to compute the hash, h, of the encrypted data. 
Since the last block contains K @ h and we know the hash 
value h, we may exclusive-or the last block with the hash 
to find (K @h@h). Since h@®h equals zero, the result 
is the random key K. The random key is then used to 
decrypt the encrypted data, and the canary is checked to 
detect corruption. 


4 A Concrete Example 


To help illustrate, we present a concrete example. Sup- 
pose we have a 4KB block of data, D that we wish to 
massage into 16 slices on 16 storage nodes so that we 
may reconstruct and verify the data so long as we pos- 
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Figure 6: Restoring data from an AONT package. 


sess any 10 slices. 

Shamir: To apply Shamir’s algorithm, we view the 
data as 4096 individual bytes, do,...,d4o95. Each of the 
16 slices So,...S,5 will also be composed of 4096 in- 
dividual bytes s;,...,5;,4995 such that s;,; is a function 
of d; and nine random bytes. Specifically, 


9 
sig = dj DF G+1) rj, 
x=1 


where rj, is a random byte and arithmetic is 
over GF (28). The total storage requirement is 64 KB. 

Rabin: To apply Rabin, we pad D to be 4100 bytes 
and then partition it into ten data slices DSo,...,DSo 
of 410 bytes each. As with Shamir, we view each 
data slice DS; to be composed of 410 individual 
bytes DSjo,...Ds;4o9. We then calculate each of the 
16 slices using Reed-Solomon coding on the individual 
bytes: 7 

9 


Sij= S (i+ Lid 
x=0 


Again, arithmetic is over GF(2°). The total storage re- 
quirement is 16*410 = 6.41 KB. 

SSMS: With SSMS, we select a random 16-byte en- 
cryption key and encrypt the data with an encryption al- 
gorithm such as AES. We then disperse it using Rabin 
and disperse the key using Shamir. The total storage re- 
quirement is 16*(410+16) = 6.65 KB. 

AONT-RS: We will be adding 34 additional bytes to 
the data, and we will first view it as being composed 
of 257 16-byte words, do,...,d256, where the first 256 
words are the original data. We set d2s56 to be a 16-byte 
canary value. We choose K to be sixteen random bytes 
and set each c; to equal dj @ E(K,i+1) where E is a 
standard encryption algorithm. Next we calculate h to 
be a 16-byte hash of co,...,Cc256. Finally, we set c257 to 
equal h@ K. The last 2 bytes are immaterial — they are 
simply padding so that the data may be partitioned into 
ten equal slices. They could be used as additional ca- 
naries if desired. 


2While Rabin does not use a Vandermonde matrix in [21], the ma- 
trix he employs has the same properties. 


USENIX Association 


USENIX Association 


As with Rabin, we partition the 4130 bytes into ten 
data slices DSg,...,DS 9 of 413 bytes each. These will 
be stored on the first ten storage nodes. Six additional 
coding slices CSo,...,CS5 will be calculated using a dif- 
ferent dispersal matrix, such as the one depicted in Fig- 
ure 7, which is derived from the Vandermonde matrix for 
systematic coding (see [18] for an explanation of why a 
Vandermonde matrix is inadequate for this purpose). The 
total storage requirement is 16*413 = 6.45 KB. 


1 1 1 1 1 1 1 1 1 
147) 138) 73 93 161 103 58 99 =«178 
103. 156 151 123 187 166 175 244) 83 
58 203 60 48 S51 175) 52 16 30 
93 151 205 212 44 123 48 197 244 
220 166 123 82 143 245 40 167 122 


Figure 7: Dispersal matrix for the systematic (10,16) 
Reed-Solomon code over GF (28). 


In each of the four methods, a client or attacker needs 
to acquire 10 of the 16 slices to read the data. Each 
method has different security and performance charac- 
teristics, which are included in the sections of Security 
and Performance below. 


5 Security Evaluation 


The threat model that we use is one where individual 
storage servers belong to different domains, both admin- 
istrative and physical. Servers may be lost due to non- 
security-related events like power failure or water dam- 
age, or their security may be compromised; for example a 
rogue system administrator or outside attacker can steal 
data. Moreover, servers may become corrupted either 
maliciously or due to the natural process of time. We 
assume that the physical dispersal of storage servers is 
limiting on an attacker, and that the difficulty of breach- 
ing servers in multiple domains, along with a judicious 
choice of & and n, is sufficient to make the system se- 
cure. 

All of these schemes provide a good level of security — 
if one cannot truly decode the data without acquiring all k 
slices, then an attacker without some a priori information 
about the data will not be able to glean anything from 
fewer than k slices. In the words of Rabin, “We do not 
see a way of fully reconstructing even small portions of D 
from k — 1 pieces” [21]. ? 

However, if an attacker has some notion of what data 
he or she is seeking but possesses fewer than k — | slices, 
then the schemes differ greatly. We will consider the 
most pathological example: An attacker possesses m < k 


3We have changed the variables in the quote to match our paper. 


slices of the codeword C and wants to verify whether the 
data that it encodes matches some predetermined value. 
Further, if the attacker can verify that one slice of D 
matches, then the attacker can be assured that the rest 
matches. While this seems rather generous to the at- 
tacker, there are many realistic attacking scenarios that 
can be reduced to this one [7]. For each algorithm, we 
assume that the attacker knows how the slices were gen- 
erated, except for the random numbers. 

Shamir: Shamir’s security is guaranteed. Attackers 
cannot get any information from fewer than k slices, 
regardless of their computing power. For example, 
with k — 1 slices each of size w, there are 2" potential 
values of do that can generate those slices. Thus, every 
possible value of dg is equally likely. One needs the k-th 
slice to determine the actual value of dg. This is informa- 
tion theoretic security. 

Rabin: Since Rabin’s IDA has no randomness, it has 
no security, even if the attacker owns just one slice. Since 
the attacker knows how the slices are generated, com- 
promise consists solely of verifying that a slice has a 
predetermined value. Further, if the generator matrix is 
known and the data has recognizable patterns (i.e. it is 
not random looking) then it is possible to guess the con- 
tent of missing slices. If one has k — | slices, trying each 
of the 2” possibilities for words of a missing slice will 
yield k recognizable words when the correct value is at- 
tempted. 

SSMS: SSMS has computational security [12]. With- 
out the key, one has to break the encryption, which can 
be made computationally intractible with a large enough 
key. Moreover, since Shamir protects the key with in- 
formation theoretic security, there is no way get the key 
with fewer than k slices. 

AONT-RS: AONT has the property that unless one 
has all of the encrypted data, one cannot decode any 
of it. This is because one needs all of the data to dis- 
cover K, and one cannot decode any of the data with- 
out K. However, if an attacker owns K and one slice, then 
the attacker can easily verify that D has a predetermined 
value, just as in Rabin. Thus, we analyze the difficulty 
in having the attacker figure out K’s value. Suppose the 
attacker owns the first slice, which contains the first en- 
coded word of D, which is equal to dp) @ E(K, 1). The en- 
coding function guarantees that enumeration is the only 
way to discover K’s value, which means that an attacker 
must test up to 24 potential values of K to discover its 
real value. Like SSMS, this is computational security. 

Thus, both AONT-RS and SSMS have computational 
security. If an attacker owns any data slice, then com- 
promise can only occur by discovering K as above. If 
an attacker owns a coding slice, then the attacker must 
again enumerate potential values of K, calculate poten- 
tial values of the slice and verify them. Owning k— 1 
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Shamir 


Rabin 


Perf(n,k, kb) 
Perf(n, k,b) 


AONT-RS | AONT(b) + Perf(n— k,k,b) 





Table 1: Running time and storage requirements of the three dispersal algorithms. 


slices adds no information — the act of verification still 
boils down to enumerating all potential values of K. The 
encryption and therefore missing words in other slices 
cannot be guessed in the same way they can under Ra- 
bin. 

Special mention must be made of storing K @h as 
the last element of the codeword. Cryptographic hash 
functions are designed to have an unpredictable and uni- 
formly distributed output. Further, they are designed to 
follow the strict avalanche criterion [35], meaning h is 
dependent on every bit of input. Therefore unless an at- 
tacker knows all code words co,...,¢s, 4 cannot be pre- 
dicted. Modeling the hash function as a random oracle, h 
encrypts K in the same manner as a One-Time-Pad [34] 
and provides information theoretic security since h is the 
same length as K. Therefore K @h yields no information 
about K when h/ is unknown. 

Moreover, the avalance criterion allows the canary to 
be sufficient to check integrity. If any bit of the stored 
slices is modified, then with sufficient probability, the 
calculated hash h’ will be different from the one used 
to calculate the difference. Since h’ differs from h, the 
calculated encryption key K will be incorrect, and as a 
result, the value in the calculated canary will differ from 
its known value. 

While computational security is not as strong as infor- 
mation theoretic security, in our view it is functionally 
equivalent. As long as ws is sufficiently large, it is com- 
putationally infeasible for an attacker to even verify that 
slices hold given data. For example, when w = 256 as in 
Section 4, compromise requires the enumeration of 27°° 
keys. To put this in perspective, if each person on earth 
had access to a trillion computers that can test a trillion 
keys per second, it would take over 10° years on average 
to correctly guess the key. According to some estimates 
of proton half-life, most matter in the universe will have 
decayed before the key would be found [1]. 


6 Theoretical Performance 


Let Perf(R,C,S) be the CPU time that it takes to en- 
code D, composed of S total bytes, with a R x C disper- 
sal matrix. In terms of big-O notation, Perf(R,C,S) = 
O(RCS). A more precise evaluation of Perf(R,C,S) is 
difficult, because of the variety of ways that the encod- 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


ing may be implemented. If one implements the encod- 
ing with standard finite field arithmetic, then: 


ey 


PeC Se) Mult XOR 


where Mult is the bandwidth of performing Galois Field 
multiplication and XOR is the bandwidth of perform- 
ing XOR operations. This is because encoding becomes 
a series of dot products to create R coding slices each 
of whose size is * bytes. The difference in the num- 
ber of multiplications vs. XORs arises becuase nearly 
all dispersal matrices are like Figure 7 and have ones in 
their top rows and leftmost columns. Implementations 
of Reed-Solomon coding do, however, differ in their per- 
formance characteristics. Using Cauchy Reed-Solomon 
coding [3], for example, substitutes additional XOR op- 
erations for the multiplication and can improve perfor- 
mance significantly [19]. 

Additionally, let AONT(S) be the time that it takes 
to perform the AONT on S bytes of data. The choice 
of wa, encryption and hashing technique will all af- 
fect AONT(S). In general, though, it is O(S) and is also 
easy to parallelize [24]. 

Given the parameters k, n, b, Perf(R,C,S), and 
AONT(S) the performance of the three main dispersal 
algorithms and their storage requirements are given in 
Table 1. Since SSMS doesn’t specify a recommended 
dispersal or encryption algorithm, we omit it from the 
remaining analyses. Roughly, its performance will be 
close to AONT-RS. 


7 Microbenchmark Performance 


To assess actual performance, we used open-source C li- 
braries to perform the various functionalities. All tests 
were performed on a 4-core Intel Xeon W3530 at 2.80 
GHz with 6 GB of memory at 1066 MHz running Linux 
kernel 2.6.32. Despite having multiple cores, all bench- 
marks were performed using a single thread. For Reed- 
Solomon coding, we used Luigi Rizzo’s open source li- 
brary over GF(2°) [25]. We tested a variety of k-of-n 
configurations, ranging from 3-of-6 to 32-of-64, measur- 
ing ce, defined as the bandwidth of creating each coding 
slice, times k. For a given machine, c, should be rela- 
tively constant, since the time to create each coding slice 
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Figure 8: Performance comparison of the dispersal algorithms. Each graph affixes the k-to-n rate and plots speed of 


encoding with each dispersal algorithm. 


should be linear in k. Despite the wide disparity in con- 
figurations, we observe that c, is fairly consistent, with 
a minimum of 921.60 MB/s in the 3-of-6 configuration, 
to a maximum of 994.00 MB/s in 27-of-54. The average 
performance for the 30 configurations tested is 965.61 
MB/s with a standard deviation of 11.42 MB/s. Thus, we 
can use ce to approximate Perf as: 


RCS 
Perf(R.C.5) = 365 GrMB/s: 

The encoding time for AONT is dependent on the 
choice of cipher and hash function. To encode S bytes 
using AONT, both the cipher and hash function must pro- 
cess S bytes. Therefore the time equals the sum of the 
time to encrypt S bytes plus the time to hash S bytes. 
We tested the performance of two pairs of cipher/hash 
algorithms, one tailored for high security (AES-256 and 
SHA-256) and the other tailored for performance (RC4- 
128 and MDS). For this test, we used OpenSSL 0.9.8k 
with a block size of 8 KB. The results are in Table 2. 


|_| Encoding Rate (MB/s) 
AES-256 143.30 


RC4-128 414.17 
SHA-256 160.03 
559.47 





Table 2: Performance of two encryption algorithms 
(AES-256 and RC4-128) and two hash algorithms (SHA- 
256 and MDS). 


Thus, we come up with two functions for AONT(S), 
one which we call secure (AES-256 and SHA-256), and 
one which we call fast (RC4-128 and MDS): 


S 
AONT secure(S) 75.60MB/s 
S 
AONTrast (S ) 237.99MB/s 


We now have the necessary information to use Table | 
to evaluate the performance of the three dispersal algo- 
rithms for any k-of-n configuration. We do so in Figure 8. 
Each graph affixes a k-of-n ratio called a rate and then 
plots the speed of encoding in MB of data per second. 
The rates increase by é for each successive graph, start- 
ing with a very low rate of 7 and proceeding to a very 
high rate of 2. 

The trade-offs of the various formulas are apparent 
from the graph. There is a dispersal cost for all three 
algorithms and an AONT cost for the AONT-RS algo- 
rithms. The AONT cost is constant, since it depends 
solely on the size of the data. Thus, when disper- 
sal is very fast, as in the 1-of-6 and 2-of-6 cases, Ra- 
bin outperforms AONT-RS fast and Shamir outperforms 
AONT-RSgecure. AS k and n grow, however, the dispersal 
costs increase. This increase is most pronounced with 
Shamir, then with Rabin and finally with AONT-RS. For 
each rate except the very low é there is a point where 
the performance of AONT-RS fast becomes the best, and 
a point where AONT-RScgecure’s performance surpasses 
both Shamir and Rabin. These points come at lower val- 
ues of n for higher k-of-n rates. 

A schematic of Cleversafe’s storage architecture is de- 
picted in Figure 9. Although not plotted above, of spe- 
cial interest is the 3-of-5 data point, since this is the k- 
of-n configuration measured by POTSHARDS [30], an 
archival storage system that uses Shamir for both fault- 
tolerance and security. For this configuration, the perfor- 
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Figure 9: A high-level picture of Cleversafe’s storage architecture. 


mance of AONT-RSsecure (65.4 MB/s) is nearly identical 
to Shamir (64.4 MB/s), which means that a system like 
POTSHARDS can achieve computational security rather 
than information theoretic security for the same perfor- 
mance, but with a factor of three less storage. 


8 Commercial Dispersed Storage System 


AONT-RS is a feature in the storage software and appli- 
ances sold by Cleversafe, which developed the technique 
to address the threat model of compromise, theft or loss 
of disks and devices. By appropriately tuning the disper- 
sal configuration, all disks or devices at an entire site can 
be stolen and the data will remain confidential. Similarly, 
as long as a minimum threshold of servers are available, 
subsets of servers may be brought offline temporarily for 
maintenance, or permanently for replacement. Since the 
servers are protected by AONT-RS, storage owners may 
dispose of servers without having to “wipe” the drives 
clean, since the information on the servers is impossible 
to obtain without gaining access to some subset of the 
remaining servers. 

Two paradigms are exposed to clients — a block 
paradigm that supports standard protocols like NFS, 
CIFS, FTP and iSCSI, and an object paradigm that sup- 
ports larger storage units for better performance. An 
Accesser calculates mappings that associate blocks or 
objects to slices on dispersed storage servers (termed 
“Slicestores” in Cleversafe’s product). A common con- 
figuration is to encode each block or object into 16 slices 
using a (10, 16)-threshold AONT-RS scheme. 

Block reads and writes that use iSCSI go through the 
Accesser. The Accesser performs the block-to-slice en- 
coding and decoding, and also manages the traffic to and 
from the servers. The other protocols require a Gate- 
way, typically co-located with the Accesser, that trans- 
lates between the various file protocols and iSCSI. Since 
this path has two hops and interacts with the servers 
with small messages, the performance of the block pro- 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


tocols is limited by the networking hardware and not the 
AONT-RS protocol. Storage servers do support multiple 
Accessers, which relieves one bottleneck of the block- 
based system. 


To achieve better performance, Cleversafe also exports 
a protocol for large objects. Objects are partitioned into 
Megabyte-sized chunks, which are then encoded into 
slices for dispersal. Clients may either read and write ob- 
jects through the Accesser using HTTP, or they may use a 
SDK to perform their own AONT-RS encoding/decoding 
so that they may interact directly with the servers. In 
both cases, the client manages the context of the object 
name. A common software architecture is that clients 
use a database to maintain the the meaning and relation- 
ships of the content, and they store the object names in a 
column of the database. 

Slice pointers are 48 bytes in length and are com- 
posed of three parts: routing information that enables 
slices to be routed to and from the correct servers, the 
source name which identifies the slice, and vault infor- 
mation which enables access control. The source name 
is Opaque — its interpretation is dependent on the specific 
client and server. Vaults are logical containers of stor- 
age. Each vault has its own quotas, data coding param- 
eters and access controls. Access controls are identity- 
based; each vault may have an arbitrary number of ac- 
counts granted read or write permissions to it. 


Each slice is stored with metadata that identifies the 
slice’s coding parameters and a version number. The ver- 
sion number is increased for each distinct write of the 
block or object, and concurrency control is maintained 
via the SDK with transactions and a three-phase com- 
mit. An additional parameter of each system is the write 
threshold, z, where k < z <n. This specifies how many 
slices must be written before a write can be committed. 
Setting z closer to k improves latency at the expense of 
reliability for a window of time. The remaining (n — z) 
writes are processed in the background, which reduces 
this window of exposure. 
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Figure 10: Actual and projected performance of dispersed storage of 10 MB objects on a (5,8) test configuration. 


Authentication in the system is two-way: servers au- 
thenticate themselves to clients by means of a digital 
certificate, which identifies it within the dispersed stor- 
age system and allows TLS sessions to be created. The 
method of authentication of the client to the server is 
flexible — both password and certificate-based authen- 
tication are supported. Despite use of AONT-RS, se- 
cure network communication is still required for security 
since a threshold number of slices travel together over 
the ‘last mile’ of the client’s connection. 

All components are written in Java. Reed-Solomon 
erasure coding is performed using Java’s FEC li- 
brary [16], and encryption using SunJCE. 


9 Measured Performance 


To measure performance, we use a commercial config- 
uration with one or two clients and eight servers. The 
client and Accesser machines each have a 4-core Intel(R) 
Xeon(R) X3430 processor running at 2.40 GHz with 8 
MB cache and 16 GB of ECC RAM. Four GB of memory 
is allocated to the JVM when executing the software. We 
use the Java HotSpot(TM) 64-Bit Server VM (build 17.0- 
b16, mixed mode) running Java 1.6.021. The storage 
servers each have a 4-core Intel(R) Xeon(R) X3460 pro- 
cessor at 2.80 GHz with 8 MB cache and 16 GB of ECC 
RAM. For storage, each server has twelve 2 TB Seagate 
SATA drives. The networking between components con- 
sists of a 10 Gb Ethernet switch. To handle simultaneous 
connections to multiple servers, the Accessers have 10 
Gb network interface cards. The servers’ cards are 1 Gb. 

Our main test has the client spend 10 minutes reading 
and writing 10 MB objects, held in main memory, to the 
eight-server storage network, using the SDK and object 
interface. The coding parameters are k = 5 and n= 8, and 
five threads are employed by the client to leverage all of 
its cores. As in section 7, we recorded microbenchmarks 


of the various components of dispersal: 


S 
AONTsecure(S) = 104.77MB/s 
S 
AONTraast(S) = 249.03MB/s 
Co = 2628MB/s 


The performance of a control and the dispersal algo- 
rithms is shown in Figure 10(a). The control has the 
client perform no encoding, but still sends 8 slices to the 
servers. While the Cleversafe implementation is flexible, 
allowing us to embed Rabin and both AONT-RS disper- 
sal algorithms, we did not implement Shamir within the 
framework. This is because the blowup of storage re- 
quirements by a factor of five would be unreasonable. 

We show the actual performance of writes and reads 
for the control, the two AONT-RS implementations and 
Rabin. We also include the projected write performance 
of the dispersal algorithms, including Shamir, using the 
performance equations from section 7, the microbench- 
marks, plus the performance of the control as the actual 
dispersal bandwidth (214 MB/s). 

For the three dispersal algorithms that we tested, the 
projected performance was within ten percent of the ac- 
tual performance. We find this result compelling be- 
cause the system on which the tests were performed was 
a production-level system, implementing the full func- 
tionality of Cleversafe’s commercial storage system, in- 
cluding access control and metadata management. 

In the tests with coding, the CPU utilization of the 
client is measured to be 90%. Since the closest I/O bot- 
tlenecks are the eight 1-Gbps links to the storage servers, 
it is clear that the limiting factor in these tests is the 
ability of the client computer to process data. To fur- 
ther affirm the client as bottleneck, we ran two clients 
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simultaneously and present their performance in Fig- 
ure 10(b). The clients’ performance is nearly identical 
to Figure 10(a). 

It is worth noting that AONT-RSsecure exhibits worse 
performance when reading than when writing; we ex- 
pected that during reads, less CPU resources would be 
required, since some slices do not need to be processed 
by the IDA. The worse performance is due to the SunJCE 
implementation of AES, which is significantly slower 
when decrypting than when encrypting. In a stand-alone 
benchmark we observed 31.51 MB/s vs. 44.77 MB/s 
when encrypting. 


10 Tales of Deployment 


Today, there are over 20 Cleversafe dispersed storage 
installations in pilot and production around the world, 
with customers drawing from a diverse set of industries 
including financial, health care, entertainment, and de- 
fense. Several customers (who have asked to remain 
anonymous) have cited one important factor in their 
purchasing decision: that the contents of small sets of 
servers are meaningless in isolation. Thus, one can de- 
commission disk drives or potentially even server sites 
without having to “wipe” them, which can be expen- 
sive +. Since nearly all U.S. states have “data breach 
laws,” that require companies to proactively disclose the 
loss of storage that is not encrypted [33], using AONT- 
RS can save companies time, attorney fees and bad pub- 
licity that results from having to alert consumers to a data 
breach. 

One of Cleversafe’s deployments is for The Museum 
of Broadcast Communications that serves its video col- 
lections on the Internet. In particular, over 8,500 hours 
of historical audio and video content have been digitized 
and stored on tens of terabytes in one of Cleversafe’s dis- 
persed storage systems. Roughly 200,000 monthly visi- 
tors access the archives over the web. 

The Museum deployment is composed of 16 storage 
servers, each having 4 TB of raw capacity and spread 
across 8 sites: Chicago, Dallas (two locations) Denver, 
New Jersey, San Francisco, Seattle and Tampa. The sites 
are situated across three power grids in the continental 
United States, and the data is dispersed in a 10-of-16 
configuration. In this way, even if one entire power grid 
shuts down, enough servers will remain accessible to re- 
trieve all the data. The Museum uses the object store 
interface inside its internal database, so that users em- 
ploy the database to search a rich set of metadata about 
the movies, which can then be retrieved using the object 
handle. 


“For example, see http://www.east-tec.com/enterprise/ 
disposesecureent/. 
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Internally, Cleversafe maintains dispersed storage sys- 
tems having over | PB of capacity. These are used in- 
ternally for development, testing and storing production 
data. Employees have their own personal vaults with ac- 
cess to a 30 TB pool of dispersed storage, which is imple- 
mented over 8 geographically separated storage servers 
across the United States. 

In one case, Cleversafe initially deployed a system 
across four sites, but at a later time decided that it should 
be migrated to 8 sites to provide better tolerance to site 
and power grid outages. To accomplish this without 
bringing the system down, machines were incrementally 
boxed up and shipped across the country, such that at 
all times a threshold number remained online. There- 
fore the system remained accessible for reads and writes 
throughout the process. The same essential technique is 
now used to apply software updates. Nodes are upgraded 
individually allowing the system to maintain availability 
throughout the upgrade process. 


11 Conclusion 


Dispersed storage systems enable availability, scalabil- 
ity, and performance based on physical proximity. They 
also enable security via (k,n) threshold schemes that re- 
quire attackers to authenticate themselves to k of n stor- 
age nodes in order to read data. The threshold schemes 
provide this security without relying on the secure stor- 
age of encryption keys, which is a notoriously difficult 
problem. 

We have described a new dispersal algorithm called 
AONT-RS, which combines the All-Or-Nothing Trans- 
form with systematic Reed-Solomon codes to achieve 
computational security. Compared to traditional ap- 
proaches to dispersal, AONT-RS has a very attractive 
blend of properties. Its storage and computational foot- 
print is much less than Shamir secret sharing. While 
Shamir achieves information theoretic security AONT- 
RS’s security can be tuned so that compromise is com- 
putationally infeasible. Compared to Rabin’s classic dis- 
persal algorithm, AONT-RS achieves a far greater degree 
of security, and also better performance for larger instal- 
lations. This is because AONT-RS is based on a sys- 
tematic Reed-Solomon erasure code rather than the non- 
systematic code employed by Rabin. We have detailed 
the theoretical and applied performance of the dispersal 
algorithms, and described a commercial dispersed stor- 
age product that is based upon the dispersal algorithm. 

AONT-RS is not specific to our dispersal solution. For 
example, the POTSHARDS archival storage system [30] 
could use AONT-RS to implement computational rather 
than information theoretic security and reduce their stor- 
age requirements by a factor of three. Other solutions 
such as Gridsharing [31] can improve their security by 
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employing AONT-RS rather than a standard systematic 
Reed-Solomon code. 

In future work, we would like to collect data from 
our private and commercial deployments concerning fail- 
ures, node availability, compromise and attack. Such 
data will enable us to make better policy decisions con- 
cerning configurations of dispersed storage. These deci- 
sions will allow us to tune the AONT and erasure code 
configuration used, and will also allow us to make the 
most efficient use of our storage. 
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Abstract 

Benchmarking file and storage systems on Jarge file- 
system images is important, but difficult and often in- 
feasible. Typically, running benchmarks on such large 
disk setups is a frequent source of frustration for file- 
system evaluators; the scale alone acts as a strong deter- 
rent against using larger albeit realistic benchmarks. To 
address this problem, we develop David: a system that 
makes it practical to run large benchmarks using modest 
amount of storage or memory capacities readily available 
on most computers. David creates a “compressed” ver- 
sion of the original file-system image by omitting all file 
data and laying out metadata more efficiently; an online 
storage model determines the runtime of the benchmark 
workload on the original uncompressed image. David 
works under any file system as demonstrated in this pa- 
per with ext3 and btrfs. We find that David reduces stor- 
age requirements by orders of magnitude; David is able 
to emulate a 1 TB target workload using only an 80 GB 
available disk, while still modeling the actual runtime ac- 
curately. David can also emulate newer or faster devices, 
e.g., we Show how David can effectively emulate a multi- 
disk RAID using a limited amount of memory. 


1 Introduction 


File and storage systems are currently difficult to bench- 
mark. Ideally, one would like to use a benchmark work- 
load that is a realistic approximation of a known appli- 
cation. One would also like to run it in a configuration 
representative of real world scenarios, including realistic 
disk subsystems and file-system images. 

In practice, realistic benchmarks and their realistic 
configurations tend to be much larger and more com- 
plex to set up than their trivial counterparts. File system 
traces (e.g., from HP Labs [17]) are good examples of 
such workloads, often being large and unwieldy. Devel- 
oping scalable yet practical benchmarks has long been 
a challenge for the storage systems community [16]. In 
particular, benchmarks such as GraySort [1] and SPEC- 
mail2009 [22] are compelling yet difficult to set up and 
use currently, requiring around 100 TB for GraySort and 
anywhere from 100 GB to 2 TB for SPECmail2009. 
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Benchmarking on large storage devices is thus a fre- 
quent source of frustration for file-system evaluators; the 
scale acts as a deterrent against using larger albeit realis- 
tic benchmarks [24], but running toy workloads on small 
disks is not sufficient. One obvious solution is to contin- 
ually upgrade one’s storage capacity. However, it is an 
expensive, and perhaps an infeasible solution to justify 
the costs and overheads solely for benchmarking. 

Storage emulators such as Memulator [10] prove ex- 
tremely useful for such scenarios — they let us prototype 
the “future” by pretending to plug in bigger, faster stor- 
age systems and run real workloads against them. Mem- 
ulator, in fact, makes a strong case for storage emulation 
as the performance evaluation methodology of choice. 
But emulators are particularly tough: if they are to be 
big, they have to use existing storage (and thus are slow); 
if they are to be fast, they have to be run out of memory 
(and thus they are small). 

The challenge we face is how can we get the best of 
both worlds? To address this problem, we have devel- 
oped David, a “scale down” emulator that allows one 
to run large workloads by scaling down the storage re- 
quirements transparently to the workload. David makes 
it practical to experiment with benchmarks that were oth- 
erwise infeasible to run on a given system. 

Our observation is that in many cases, the benchmark 
application does not care about the contents of individ- 
ual files, but only about the structure and properties of 
the metadata that is being stored on disk. In particular, 
for the purposes of benchmarking, many applications do 
not write or read file contents at all (e.g., fsck); the ones 
that do, often do not care what the contents are as long as 
some valid content is made available (e.g., backup soft- 
ware). Since file data constitutes a significant fraction 
of the total file system size, ranging anywhere from 90 
to 99% depending on the actual file-system image [3] 
avoiding the need to store file data has the potential to 
significantly reduce the required storage capacity during 
benchmarking. 

The key idea in David is to create a “compressed” ver- 
sion of the original file-system image for the purposes of 
benchmarking. In the compressed image, unneeded user 
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data blocks are omitted using novel classification tech- 
niques to distinguish data from metadata at scale; file 
system metadata blocks (e.g., inodes, directories and in- 
direct blocks) are stored compactly on the available back- 
ing store. The primary benefit of the compressed image 
is to reduce the storage capacity required to run any given 
workload. To ensure that applications remain unaware of 
this interposition, whenever necessary, David syntheti- 
cally generates file data on the fly; metadata I/O is redi- 
rected and accessed appropriately. David works under 
any file system; we demonstrate this using ext3 [25] and 
btrfs [26], two file systems very different in design. 

Since David alters the original I/O patterns, it needs 
to model the runtime of the benchmark workload on the 
original uncompressed image. David uses an in-kernel 
model of the disk and storage stack to determine the 
run times of all individual requests as they would have 
executed on the uncompressed image. The model pays 
special attention to accurately modeling the I/O request 
queues; we find that modeling the request queues is cru- 
cial for overall accuracy, especially for applications issu- 
ing bursty I/O. 

The primary mode of operation of David is the timing- 
accurate mode in which after modeling the runtime, an 
appropriate delay is inserted before returning to the ap- 
plication. A secondary speedup mode is also available 
in which the storage model returns instantaneously after 
computing the time taken to run the benchmark on the 
uncompressed disk; in this mode David offers the poten- 
tial to reduce application runtime and speedup the bench- 
mark itself. In this paper we discuss and evaluate David 
in the timing-accurate mode. 

David allows one to run benchmark workloads that re- 
quire file-system images orders of magnitude larger than 
the available backing store while still reporting the run- 
time as it would have taken on the original image. We 
demonstrate that David even enables emulation of faster 
and multi-disk systems like RAID using a small amount 
of memory. David can also aid in running large bench- 
marks on storage devices that are expensive or not even 
available in the market as it requires only a model of the 
non-existent storage device; for example, one can use a 
modified version of David to run benchmarks on a hypo- 
thetical 1TB SSD. 

We believe David will be useful to file and storage 
developers, application developers, and users looking to 
benchmark these systems at scale. Developers often like 
to evaluate a prototype implementation at larger scales 
to identify performance bottlenecks, fine-tune optimiza- 
tions, and make design decisions; analyses at scale of- 
ten reveal interesting and critical insights into the sys- 
tem [16]. David can help obtain approximate perfor- 
mance estimates within limits of its modeling error. For 
example, how does one measure performance of a file 
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Figure 1: Capacity Savings. Shows the savings in stor- 
age capacity if only metadata is stored, with varying file-size 
distribution modeled by (Ut, 0) parameters of a lognormal dis- 
tribution, (7.53, 2.48) and (8.33, 3.18) for the two extremes. 


system on a multi-disk multi-TB mirrored RAID con- 
figuration without having access to one? An end-user 
looking to select an application that works best at larger 
scale may also use David for emulation. For example, 
which anti-virus application scans a terabyte file system 
the fastest? 


One challenge in building David is how to deal with 
scale as we experiment with larger file systems contain- 
ing many more files and directories. Figure | shows the 
percentage of storage space occupied by metadata alone 
as compared to the total size of the file-system image 
written; the different file-system images for this experi- 
ment were generated by varying the file size distribution 
using Impressions [2]. Using publicly available data on 
file-system metadata [4], we analyzed how file-size dis- 
tribution changes with file systems of varying sizes. 


We found that larger file systems not only had more 
files, they also had larger files. For this experiment, 
the parameters of the lognormal distribution controlling 
the file sizes were changed along the x-axis to gen- 
erate progressively larger file systems with larger files 
therein. The relatively small fraction belonging to meta- 
data (roughly | to 10%) as shown on the y-axis demon- 
strates the potential savings in storage capacity made 
possible if only metadata blocks are stored; David is de- 
signed to take advantage of this observation. 


For workloads like PostMark, mkfs, Filebench 
WebServer, Filebench VarMail, and other mi- 
crobenchmarks, we find that David delivers on its 
promise in reducing the required storage size while still 
accurately predicting the benchmark runtime for both 
ext3 and btrfs. The storage model within David is fairly 
accurate in spite of operating in real-time within the ker- 
nel, and for most workloads predicts a runtime within 
5% of the actual runtime. For example, for the Filebench 
webserver workload, David provides a 1000-fold reduc- 
tion in required storage capacity and predicts a runtime 
within 0.08% of the actual. 
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Figure 2: Metadata Remapping and Data Squashing 
in David. The figure shows how metadata gets remapped and 
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data blocks are squashed. The disk image above David is the 
target and the one below it is the available. 


2 David Overview 


2.1 Design Goals for David 


e Scalability: Emulating a large device requires David 
to maintain additional data structures and mimic sev- 
eral operations; our goal is to ensure that it works well 
as the underlying storage capacity scales. 

e Model accuracy: An important goal is to model 
a storage device and accurately predict performance. 
The model should not only characterize the physical 
characteristics of the drive but also the interactions un- 
der different workload patterns. 

e Model overhead: Equally important to being accu- 

rate is that the model imposes minimal overhead; since 

the model is inside the OS and runs concurrently with 
workload execution, it is required to be fairly fast. 

Emulation flexibility: David should be able to emu- 

late different disks, storage subsystems, and multi-disk 

systems through appropriate use of backing stores. 

e Minimal application modification: It should allow 
applications to run unmodified without knowing the 
significantly less capacity of the storage system under- 
neath; modifications can be performed in limited cases 
only to improve ease of use but never as a necessity. 


2.2 David Design 


David exports a fake storage stack including a fake de- 
vice of a much higher capacity than available. For the 
rest of the paper, we use the terms target to denote the 
hypothetical larger storage device, and available to de- 
note the physically available system on which David is 
running, as shown in Figure 2. It also shows a schematic 
of how David makes use of metadata remapping and data 
squashing to free up a large percentage of the required 
storage space; a much smaller backing store can now ser- 
vice the requests of the benchmark. 

David is implemented as a pseudo-device driver that 


is situated below the file system and above the backing 
store, interposing on all I/O requests. Since the driver 
appears as a regular device, a file system can be created 
and mounted on it. Being a loadable module, David can 
be used without any change to the application, file system 
or the kernel. Figure 3 presents the architecture of David 
with all the significant components and also shows the 
different types of requests that are handled within. We 
now describe the components of David. 

First, the Block Classifier is responsible for classify- 
ing blocks addressed in a request as data or metadata 
and preventing I/O requests to data blocks from going 
to the backing store. David intercepts all writes to data 
blocks, records the block address if necessary, and dis- 
cards the actual write using the Data Squasher. I/O re- 
quests to metadata blocks are passed on to the Metadata 
Remapper. 

Second, the Metadata Remapper is responsible for lay- 
ing out metadata blocks more efficiently on the backing 
store. It intercepts all write requests to metadata blocks, 
generates a remapping for the set of blocks addressed, 
and writes out the metadata blocks to the remapped loca- 
tions. The remapping is stored in the Metadata Remap- 
per to service subsequent reads. 

Third, writes to data blocks are not saved, but reads to 
these blocks could still be issued by the application; in 
order to allow applications to run transparently, the Data 
Generator is responsible for generating synthetic content 
to service subsequent reads to data blocks that were writ- 
ten earlier and discarded. The Data Generator contains a 
number of built-in schemes to generate different kinds of 
content and also allows the application to provide hints 
to generate more tailored content (e.g., binary files). 

Finally, by performing the above-mentioned tasks 
David modifies the original I/O request stream. These 
modifications in the I/O traffic substantially change the 
application runtime rendering it useless for benchmark- 
ing. The Storage Model carefully models the (potentially 
different) target storage subsystem underneath to predict 
the benchmark runtime on the target system. By doing 
so in an online fashion with little overhead, the Storage 
Model makes it feasible to run large workloads in a space 
and time-efficient manner. The individual components 
are discussed in detail in §3 through §6. 


2.3 Choice of Available Backing Store 


David is largely agnostic to the choice of the backing 
store for available storage: HDDs, SSDs, or memory can 
be used depending on the performance and capacity re- 
quirements of the target device being emulated. Through 
a significant reduction in the number of device I/Os, 
David compensates for its internal book-keeping over- 
head and also for small mismatches between the emu- 
lated and available device. However, if one wishes to 
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Figure 3: David Architecture. Shows the components of David and the flow of requests handled within. 


emulate a device much faster than the available device, 
using memory is a safer option. For example, as shown 
in 86.3, David successfully emulates a RAID-1 configu- 
ration using a limited amount of memory. If the perfor- 
mance mismatch is not significant, a hard disk as backing 
store provides much greater scale in terms of storage ca- 
pacity. Throughout the paper, “available storage” refers 
to the backing store in a generic sense. 


3 Block Classification 


The primary requirement for David to prevent data writes 
using the Data Squasher is the ability to classify a block 
as metadata or data. David provides both implicit and ex- 
plicit block classification. The implicit approach is more 
laborious but provides a flexible approach to run unmod- 
ified applications and file systems. The explicit notifica- 
tion approach is straightforward and much simpler to im- 
plement, albeit at the cost of a small modification in the 
operating system or the application; both are available in 
David and can be chosen according to the requirements 
of the evaluator. The implicit approach is demonstrated 
using ext3 and the explicit approach using btrfs. 


3.1 Implicit Type Detection 

For ext2 and ext3, the majority of the blocks are stati- 
cally assigned for a given file system size and configu- 
ration at the time of file system creation; the allocation 
for these blocks doesn’t change during the lifetime of the 
file system. Blocks that fall in this category include the 
super block, group descriptors, inode and data bitmaps, 
inode blocks and blocks belonging to the journal; these 
blocks are relatively straightforward to classify based on 
their on-disk location, or their Logical Block Address 
(LBA). However, not all blocks are statically assigned; 
dynamically-allocated blocks include directory, indirect 
(single, double, or triple indirect) and data blocks. Un- 
less all blocks contain some self-identification informa- 
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tion, in order to accurately classify a dynamically allo- 
cated block, the system needs to track the inode pointing 
to the particular block to infer its current status. 

Implicit classification is based on prior work on 
Semantically-Smart Disk Systems (SDS) [21]; an SDS 
employs three techniques to classify blocks: direct and 
indirect classification, and association. With direct clas- 
sification, blocks are identified simply by their location 
on disk. With indirect classification, blocks are identified 
only with additional information; for example, to iden- 
tify directory data or indirect blocks, the corresponding 
inode must also be examined. Finally, with association, 
a data block and its inode are connected. 

There are two significant additional challenges David 
must address. First, as opposed to SDS, David has 
to ensure that no metadata blocks are ever misclassi- 
fied. Second, benchmark scalability introduces addi- 
tional memory pressure to handle delayed classification. 
In this paper we only discuss our new contributions (the 
original SDS paper provides details of the basic block- 
classification mechanisms). 


3.1.1 Unclassified Block Store 

To infer when a file or directory is allocated and deallo- 
cated, David tracks writes to inode blocks, inode bitmaps 
and data bitmaps; to enumerate the indirect and directory 
blocks that belong to a particular file or directory, it uses 
the contents of the inode. It is often the case that the 
blocks pointed to by an inode are written out before the 
corresponding inode block; if a classification attempt is 
made when a block is being written, an indirect or di- 
rectory block will be misclassified as an ordinary data 
block. This transient error is unacceptable for David 
since it leads to the “metadata” block being discarded 
prematurely and could cause irreparable damage to the 
file system. For example, if a directory or indirect block 
is accidentally discarded, it could lead to file system cor- 
ruption. 
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To rectify this problem, David temporarily buffers in 
memory writes to all blocks which are as yet unclassi- 
fied, inside the Unclassified Block Store (UBS). These 
write requests remain in the UBS until a classification is 
made possible upon the write of the corresponding inode. 
When a corresponding inode does get written, blocks that 
are classified as metadata are passed on to the Metadata 
Remapper for remapping; they are then written out to 
persistent storage at the remapped location. Blocks clas- 
sified as data are discarded at that time. All entries in the 
UBS corresponding to that inode are also removed. 

The UBS is implemented as a list of block I/O (bio) re- 
quest structures. An extra reference to the memory pages 
pointed to by these bio structures is held by David as long 
they remain in the UBS; this reference ensures that these 
pages are not mistakenly freed until the UBS is able to 
classify and persist them on disk, if needed. In order 
to reduce the inode parsing overhead otherwise imposed 
for each inode write, David maintains a list of recently 
written inode blocks that need to be processed and uses 
a separate kernel thread for parsing. 


3.1.2 Journal Snooping 


Storing unclassified blocks in the UBS can cause a strain 
on available memory in certain situations. In particular, 
when ext3 is mounted on top of David in ordered jour- 
naling mode, all the data blocks are written to disk at 
journal-commit time but the metadata blocks are written 
to disk only at the checkpoint time which occurs much 
less frequently. This results in a temporary yet precari- 
ous build up of data blocks in the UBS even though they 
are bound to be squashed as soon as the corresponding 
inode is written; this situation is especially true when 
large files (e.g., 10s of GB) are written. In order to en- 
sure the overall scalability of David, handling large files 
and the consequent explosion in memory consumption is 
critical. To achieve this without any modification to the 
ext3 filesystem, David performs Journal Snooping in the 
block device driver. 

David snoops on the journal commit traffic for inodes 
and indirect blocks logged within a committed transac- 
tion; this enables block classification even prior to check- 
point. When a journal-descriptor block is written as part 
of a transaction, David records the blocks that are being 
logged within that particular transaction. In addition, all 
journal writes within that transaction are cached in mem- 
ory until the transaction is committed. After that, the in- 
odes and their corresponding direct and indirect blocks 
are processed to allow block classification; the identified 
data blocks are squashed from the UBS and the iden- 
tified metadata blocks are remapped and stored persis- 
tently. The challenge in implementing Journal Snooping 
was to handle the continuous stream of unordered journal 
blocks and reconstruct the journal transaction. 
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Figure 4: Memory usage with Journal Snooping. 


Figure 4 compares the memory pressure with and 
without Journal Snooping demonstrating its effective- 
ness. It shows the number of 4 KB block I/O requests 
resident in the UBS sampled at 10 sec intervals during 
the creation of a 24 GB file on ext3; the file system is 
mounted on top of David in ordered journaling mode 
with a commit interval of 5 secs. This experiment was 
run on a dual core machine with 2 GB memory. Since 
this workload is data write intensive, without Journal 
Snooping, the system runs out of memory when around 
450,000 bio requests are in the UBS (occupying roughly 
1.8 GB of memory). Journal Snooping ensures that the 
memory consumed by outstanding bio requests does not 
go beyond a maximum of 240 MB. 


3.2 Explicit Metadata Notification 


David is meant to be useful for a wide variety of file sys- 
tems; explicit metadata notification provides a mecha- 
nism to rapidly adopt a file system for use with David. 
Since data writes can come only from the benchmark ap- 
plication in user-space whereas metadata writes are is- 
sued by the file system, our approach is to identify the 
data blocks before they are even written to the file sys- 
tem. Our implementation of explicit notification is thus 
file-system agnostic — it relies on a small modification 
to the page cache to collect additional information. We 
demonstrate the benefits of this approach using btrfs, a 
file system quite unlike ext3 in design. 

When an application writes to a file, David captures 
the pointers to the in-memory pages where the data con- 
tent is stored, as it is being copied into the page cache. 
Subsequently, when the writes reach David, they are 
compared against the captured pointer addresses to de- 
cide whether the write is to metadata or data. Once the 
presence is tested, the pointer is removed from the list 
since the same page can be reused for metadata writes in 
the future. 

There are certainly other ways to implement explicit 
notification. One way is to capture the checksum of the 
contents of the in-memory pages instead of the pointer 
to track data blocks. One can also modify the file system 


FAST 711: 9th USENIX Conference on File and Storage Technologies 


207 


208 


to explicitly flag the metadata blocks, instead of identi- 
fying data blocks with the page cache modification. We 
believe our approach is easier to implement, does not re- 
quire any file system modification, and is also easier to 
extend to software RAID since parity blocks are auto- 
matically classified as metadata and not discarded. 


4 Metadata Remapping 


Since David exports a target pseudo device of much 
higher capacity to the file system than the available stor- 
age device, the bio requests issued to the pseudo device 
will have addresses in the full target range and thus need 
to be suitably remapped. For this purpose, David main- 
tains a remap table called Metadata Remapper which 
maps “target” addresses to “available” addresses. The 
Metadata Remapper can contain an entry either for one 
metadata block (e.g., super block), or a range of metadata 
blocks (e.g., group descriptors); by allowing an arbitrary 
range of blocks to be remapped together, the Metadata 
Remapper provides an efficient translation service that 
also provides scalability. Range remapping in addition 
preserves sequentiality of the blocks if a disk is used as 
the backing store. In addition to the Metadata Remapper, 
a remap bitmap is maintained to keep track of free and 
used blocks on the available physical device; the remap 
bitmap supports allocation both of a single remapped 
block and a range of remapped blocks. 

The destination (or remapped) location for a request 
is determined using a simple algorithm which takes as 
input the number of contiguous blocks that need to be 
remapped and finds the first available chunk of space 
from the remap bitmap. This can be done statically or at 
runtime; for the ext3 file system, since most of the blocks 
are statically allocated, the remapping for these blocks 
can also be done statically to improve performance. Sub- 
sequent writes to other metadata blocks are remapped dy- 
namically; when metadata blocks are deallocated, corre- 
sponding entries from the Metadata Remapper and the 
remap bitmap are removed. From our experience, this 
simple algorithm lays out blocks on disk quite efficiently. 
More sophisticated allocation algorithms based on local- 
ity of reference can be implemented in the future. 


5 Data Generator 


David services the requirements of systems oblivious to 
file content with data squashing and metadata remapping. 
However, many real applications care about file content; 
the Data Generator with David is responsible for gener- 
ating synthetic content to service read requests to data 
blocks that were previously discarded. Different systems 
can have different requirements for the file content and 
the Data Generator has various options to choose from; 
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figure 5 shows some examples of the different types of 
content that can be generated. 

Many systems that read back previously written data 
do not care about the specific content within the files as 
long as there is some content (e.g., a file-system backup 
utility, or the Postmark benchmark). Much in the same 
way as failure-oblivious computing generates values to 
service reads to invalid memory while ignoring invalid 
writes [18], David randomly generates content to service 
out-of-bound read requests. 

Some systems may expect file contents to have valid 
syntax or semantics; the performance of these systems 
depend on the actual content being read (e.g., a desk- 
top search engine for a file system, or a spell-checker). 
For such systems, naive content generation would either 
crash the application or give poor benchmarking results. 
David produces valid file content leveraging prior work 
on generating file-system images [2]. 

Finally, some systems may expect to read back data 
exactly as they wrote earlier (i.e., a read-after-write or 
RAW dependency) or expect a precise structure that can- 
not be generated arbitrarily (e.g., a binary file or a con- 
figuration file). David provides additional support to run 
these demanding applications using the RAW Store, de- 
signed as a cooperative resource visible to the user and 
configurable to suit the needs of different applications. 

Our current implementation of RAW Store is very sim- 
ple: in order to decide which data blocks need to be 
stored persistently, David requires the application to sup- 
ply a list of the relevant file paths. David then looks up 
the inode number of the files and tracks all data blocks 
pointed to by these inodes, writing them out to disk us- 
ing the Metadata Remapper just as any metadata block. 
In the future, we intend to support more nuanced ways to 
maintain the RAW Store; for example, specifying direc- 
tories instead of files, or by using Memoization [14]. 

For applications that must exactly read back a signif- 
icant fraction of what they write, the scalability advan- 
tage of David diminishes; in such cases the benefits are 
primarily from the ability to emulate new devices. 


6 Storage Model and Emulation 


Not having access to the target storage system requires 
David to precisely capture the behavior of the entire stor- 
age stack with all its dependencies through a model. The 
storage system modeled by David is the target system 
and the system on which it runs is the available system. 
David emulates the behavior of the target disk by send- 
ing requests to the available disk (for persistence) while 
simultaneously sending the target request stream to the 
Storage Model; the model computes the time that would 
have taken for the request to finish on the target system 
and introduces an appropriate delay in the actual request 
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%PDF-1.4 

%<C7>[] 

5 0 obj 

<</Length 6 0 R/Filter 
/FlateDecode>> 

stream 
x<9C><CD>]%<B7>q 
<8D><H7><A0><80> 
trailer << /Size 75 /Root 
10R Anfo 2 0 R /ID>> 
startxref 1052 % %EOF 





compress.pdf 


umask 027 
if (! $2TERM ) 
then setenv TERM 
endif 
if ($TERM==unknown) 
then set noglob; 
eval ‘tset -s -r -Q"? 
$TERM"‘; unset noglob 
endif 
if($TERM==unknown) 
goto loop endif 


config.RAW 


Figure 5: Examples of content generation by Data Generator. The figure shows a randomly generated text file, a text file 
with semantically meaningful content, a well-formatted PDF file, and a config file with precise syntax to be stored in the RAW Store. 
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Disk size 80 GB 

Rotational Speed 7200 RPM 

Number of cylinders 88283 

Number of zones 30 

567 to 1170 

1444 to 1521 
2870 KB 
260 KB 


FIFO 


Sectors per track 
Cylinders per zone 
On-disk cache size 
Disk cache segment 
Req scheduling* 


1 TB 
7200 RPM 
147583 
30 
840 to 1680 
1279 to 8320 
300 MB 
600 KB 
FIFO 


Cache segments 
Cache R/W partition 
Bus Transfer 

Seek profile(long) 
Seek profile(short) 
Head switch 
Cylinder switch 

Dev driver req queue* 
Req queue timeout* 


‘es 
133 MBps 
3800+(cyl*116)/10% 
300+ / (cyl * 2235) 
1.4 ms 
1.6 ms 
128-160 
3 ms (unplug) 


500 

Varies 

133 MBps 

3300-+(cyl*5)/10° 

700+,/cyl 
1.4 ms 
1.6 ms 
128-160 

3 ms (unplug) 


Table 1: Storage Model Parameters in David. 





Lists important parameters obtained to model disks Hitachi 


HDS728080PLA380 (H1) and Hitachi HDS721010KLA330 (H2). * denotes parameters of I/O request queue (IORQ). 


stream before returning control. Figure 3 presented in §2 
shows this setup more clearly. 

AS a general design principle, to support low-overhead 
modeling without compromising accuracy, we avoid us- 
ing any technique that either relies on storing empiri- 
cal data to compute statistics or requires table-based ap- 
proaches to predict performance [6]; the overheads for 
such methods are directly proportional to the amount 
of runtime statistics being maintained which in turn de- 
pends on the size of the disk. Instead, wherever applica- 
ble, we have adopted and developed analytical approxi- 
mations that did not slow the system down; our resulting 
models are sufficiently lean while being fairly accurate. 

To ensure portability of our models, we have refrained 
from making device-specific optimizations to improve 
accuracy; we believe current models in David are fairly 
accurate. The models are also adaptive enough to be eas- 
ily configured for changes in disk drives and other pa- 
rameters of the storage stack. We next present some de- 
tails of the disk model and the storage stack model. 


6.1 Disk Model 


David’s disk model is based on the classical model pro- 
posed by Ruemmler and Wilkes [19], henceforth referred 
as the RW model. The disk model contains informa- 
tion about the disk geometry (i.e., platters, cylinders and 
zones) and maintains the current disk head position; us- 
ing these sources it models the disk seek, rotation, and 
transfer times for any request. The disk model also keeps 
track of the effects of disk caches (track prefetching, 
write-through and write-back caches). In the future, it 
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will be interesting to explore using Disksim for the disk 
model. Disksim is a detailed user-space disk simulator 
which allows for greater flexibility in the types of device 
properties that can be simulated along with their degree 
of detail; we will need to ensure it does not appreciably 
slow down the emulation when used without memory as 
backing store. 


6.1.1 Disk Drive Profile 

The disk model requires a number of drive-specific pa- 
rameters as input, a list of which is presented in the first 
column of Table 1; currently David contains models for 
two disks: the Hitachi HDS728080PLA380 80 GB disk, 
and the Hitachi HDS721010KLA330 1 TB disk. We 
have verified the parameter values for both these disks 
through carefully controlled experiments. David is en- 
visioned for use in environments where the target drive 
itself may not be available; if users need to model addi- 
tional drives, they need to supply the relevant parameters. 
Disk seeks, rotation time and transfer times are modeled 
much in the same way as proposed in the RW model. The 
actual parameter values defining the above properties are 
specific to a drive; empirically obtained values for the 
two disks we model are shown in Table 1. 


6.1.2 Disk Cache Modeling 

The drive cache is usually small (few hundred KB to a 
few MB) and serves to cache reads from the disk me- 
dia to service future reads, or to buffer writes. Unfortu- 
nately, the drive cache is one of the least specified com- 
ponents as well; the cache management logic is low-level 
firmware code which is not easy to model. 
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David models the number and size of segments in the 
disk cache and the number of disk sector-sized slots in 
each segment. Partitioning of the cache segments into 
read and write caches, if any, is also part of the informa- 
tion contained in the disk model. David models the read 
cache with a FIFO eviction policy. To model the effects 
of write caching, the disk model maintains statistics on 
the current size of writes pending in the disk cache and 
the time needed to flush these writes out to the media. 
Write buffering is simulated by periodically emptying a 
fraction of the contents of the write cache during idle 
periods of the disk in between successive foreground re- 
quests. The cache is modeled with a write-through policy 
and is partitioned into a sufficiently large read cache to 
match the read-ahead behavior of the disk drive. 


6.2 Storage Stack Model 
David also models the I/O request queues (IORQs) main- 
tained in the OS; Table 1 lists a few of its impor- 
tant parameters. While developing the Storage Model, 
we found that accurately modeling the behavior of the 
IORQs is crucial to predict the target execution time cor- 
rectly. The IORQs usually have a limit on the maximum 
number of requests that can be held at any point; pro- 
cesses that try to issue an I/O request when the IORQ is 
full are made to wait. Such waiting processes are wo- 
ken up when an I/O issued to the disk drive completes, 
thereby creating an empty slot in the IORQ. Once wo- 
ken up, the process is also granted privilege to batch a 
constant number of additional I/O requests even when 
the IORQ is full, as long as the total number of requests 
is within a specified upper limit. Therefore, for applica- 
tions issuing bursty I/O, the time spent by a request in the 
IORQ can outweigh the time spent at the disk by several 
orders of magnitude; modeling the IORQs is thus crucial 
for overall accuracy. 

Disk requests arriving at David are first enqueued into 
a replica queue maintained inside the Storage Model. 
While being enqueued, the disk request is also checked 
for a possible merge with other pending requests: a com- 
mon optimization that reduces the number of total re- 
quests issued to the device. There is a limit on the num- 
ber of disk requests that can be merged into a single disk 
request; eventually merged disk requests are dequeued 
from the replica queue and dispatched to the disk model 
to obtain the service time spent at the drive. The replica 
queue uses the same request scheduling policy as the tar- 
get IORQ. 


6.3. RAID Emulation 


David can also provide effective RAID emulation. To 
demonstrate simple RAID configurations with David, 
each component disk is emulated using a memory- 
backed “compressed” device underneath software RAID. 
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David exports multiple block devices with separate ma- 
jor and minor numbers; it differentiates requests to dif- 
ferent devices using the major number. For the pur- 
pose of performance benchmarking, David uses a sin- 
gle memory-based backing store for all the compressed 
RAID devices. Using multiple threads, the Storage 
Model maintains separate state for each of the devices 
being emulated. Requests are placed in a single request 
queue tagged with a device identifier; individual Storage 
Model threads for each device fetch one request at a time 
from this request queue based on the device identifier. 
Similar to the single device case, the servicing thread cal- 
culates the time at which a request to the device should 
finish and notifies completion using a callback. 

David currently only provides mechanisms for simple 
software RAID emulation that do not need a model of a 
software RAID itself. New techniques might be needed 
to emulate more complex commercial RAID configura- 
tions, for example, commercial RAID settings using a 
hardware RAID card. 


7 Evaluation 


We seek to answer four important questions. First, what 
is the accuracy of the Storage Model? Second, how ac- 
curately does David predict benchmark runtime and what 
storage space savings does it provide? Third, can David 
scale to large target devices including RAID? Finally, 
what is the memory and CPU overhead of David? 


7.1 Experimental Platform 

We have developed David for the Linux operating sys- 
tem. The hard disks currently modeled are the | TB 
Hitachi HDS721010KLA330 (referred to as Dj7g) and 
the 80 GB Hitachi HDS728080PLA380 (referred to as 
Dgoca); table | lists their relevant parameters. Unless 
specified otherwise, the following hold for all the experi- 
ments: (1) machine used has a quad-core Intel processor 
and 4GB RAM running Linux 2.6.23.1 (2) ext3 file sys- 
tem is mounted in ordered-journaling mode with a com- 
mit interval of 5 sec (3) microbenchmarks were run di- 
rectly on the disk without a file system (4) David predicts 
the benchmark runtime for a target Dg while in fact 
running on the available Dgogg (5) to validate accuracy, 
David was instead run directly on D, rp. 


7.2 Storage Model Accuracy 

First, we validate the accuracy of Storage Model in pre- 
dicting the benchmark runtime on the target system. 
Since our aim is to validate the accuracy of the Stor- 
age Model alone, we run David in a model only mode 
where we disable block classification, remapping and 
data squashing. David just passes down the requests that 
it receives to the available request queue below. We run 
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Figure 6: Storage Model accuracy for Sequential and Random Reads and Writes. 
distribution of measured and modeled times for sequential and random reads and writes. 
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Figure 7: Storage Model accuracy. The graphs show the cumulative distribution of measured and modeled times for the 
following workloads from left to right: Postmark, Webserver, Varmail and Tar. 
































Implicit Classification — Ext3 Explicit Notification — Btrfs 
Benchmark Original David Storage | Original David Runtime |] Original David Runtime 
Workload Storage Storage | Savings | Runtime | Runtime Error Runtime | Runtime Error 
(KB) (KB) (%) (Secs) (Secs) (%) (Secs) (Secs) (%) 
mkfs 976762584 | 7900712 99.19 278.66 281.81 1.13 - - - 
imp 11224140 18368 99.84 344.18 339.42 -1.38 327.294 | 324.057 0.99 
tar 21144 628 97.03 257.66 255.33 -0.9 146.472 | 135.014 78 
grep - - - 250.52 254.40 1.55 141.960 | 138.455 2.47 
virus scan - - - 55.60 47.95 -13.75 27.420 31.555 15.08 
find - - - 26.21 26.60 1.5 - - - 
du - - - 102.69 101.36 -1.29 - - - 
postmark 204572 404 99.80 33.23 29.34 -11.69 22.709 22.243 2.05 
webserver 3854828 3920 99.89 127.04 126.94 -0.08 125.611 126.504 0.71 
varmail 7852 3920 50.07 126.66 126.27 -0.31 126.019 | 126.478 0.36 
sr - - - 40.32 44.90 11.34 40.32 44.90 11.34 
ir - - - 913.10 935.46 2.45 913.10 935.46 2.45, 
sw - - - 57.28 58.96 2:93 57.28 58.96 2.93 
rw - - - 308.74 291.40 -5.62 308.74 291.40 -5.62 








Table 2: David Performance and Accuracy. Shows savings in capacity, accuracy of runtime prediction, and the overhead 
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of storage modeling for different workloads. Webserver and varmail are generated using FileBench; virus scan using AVG. 


David on top of Di 7g and set the target drive to be the 
same. Note that the available system is the same as the 
target system for these experiments since we only want 
to compare the measured and modeled times to validate 
the accuracy of the Storage Model. Each block request 
is traced along its path from David to the disk drive and 
back. This is done in order to measure the total time that 
the request spends in the available IORQ and the time 
spent getting serviced at the available disk. These mea- 
sured times are then compared with the modeled times 
obtained from the Storage Model. 


Figure 6 shows the Storage Model accuracy for four 
micro-workloads: sequential and random reads, and se- 
quential and random writes; these micro-workloads have 


demerit figures of 24.39, 5.51, 0.08, and 0.02 respec- 
tively, as computed using the Ruemmler and Wilkes 
methodology [19]. The large demerit for sequential reads 
is due to a variance in the available disk’s cache-read 
times; modeling the disk cache in greater detail in the fu- 
ture could potentially avoid this situation. However, se- 
quential read requests do not contribute to a measurably 
large error in the total modeled runtime; they often hit 
the disk cache and have service times less than 500 mi- 
croseconds while other types of disk requests take around 
20 to 35 milliseconds to get serviced. Any inaccuracy in 
the modeled times for sequential reads is negligible when 
compared to the service times of other types of disk re- 
quests; we thus chose to not make the disk-cache model 
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more complex for the sake of sequential reads. 

Figure 7 shows the accuracy for four different macro 
workloads and application kernels: Postmark [13], web- 
server (generated using FileBench [15]), Varmail (mail 
server workload using FileBench), and a Tar workload 
(copy and untar of the linux kernel of size 46 MB). 

The FileBench Varmail workload emulates an NFS 
mail server, similar to Postmark, but is multi-threaded 
instead. The Varmail workload consists of a set of 
open/read/close, open/append/close and deletes in a 
single directory, in a multi-threaded fashion. The 
FileBench webserver workload comprises of a mix of 
open/read/close of multiple files in a directory tree. In 
addition, to simulate a webserver log, a file append oper- 
ation is also issued. The workload consists of 100 threads 
issuing 16 KB appends to the weblog every 10 reads. 

Overall, we find that storage modeling inside David is 
quite accurate for all workloads used in our evaluation. 
The total modeled time as well as the distribution of the 
individual request times are close to the total measured 
time and the distribution of the measured request times. 


7.3. David Accuracy 
Next, we want to measure how accurately David predicts 
the benchmark runtime. Table 2 lists the accuracy and 
storage space savings provided by David for a variety of 
benchmark applications for both ext3 and btrfs. We have 
chosen a set of benchmarks that are commonly used and 
also stress various paths that disk requests take within 
David. The first and second columns of the table show 
the storage space consumed by the benchmark workload 
without and with David. The third column shows the 
percentage savings in storage space achieved by using 
David. The fourth column shows the original bench- 
mark runtime without David on D,7z. The fifth column 
shows the benchmark runtime with David on Dgogg. The 
sixth column shows the percentage error in the predic- 
tion of the benchmark runtime by David. The final three 
columns show the original and modeled runtime, and the 
percentage error for the btrfs experiments; the storage 
space savings are roughly the same as for ext3. The sr, 
rr, sw, and rw workloads are run directly on the raw de- 
vice and hence are independent of the file system. 

mkfs creates a file system with a 4 KB block size over 
the | TB target device exported by David. This workload 
only writes metadata and David remaps writes issued by 
mkfs sequentially starting from the beginning of Dgogp; 
no data squashing occurs in this experiment. 

imp creates a realistic file-system image of size 10 GB 
using the publicly available Impressions tool [2]. A total 
of 5000 regular files and 1000 directories are created with 
an average of 10.2 files per directory. This workload is 
a data-write intensive workload and most of the issued 
writes end up being squashed by David. 
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Figure 8: Storage Space Savings and Model Accu- 
racy. The “Space” lines show the savings in storage space 
achieved when using David for the impressions workload with 
file-system images of varying sizes until SOOGB; “Time” lines 
show the accuracy of runtime prediction for the same workload. 
WOD: space/time without David, D: space/time with David. 


tar uses the GNU tar utility to create a gzipped archive 
of the file-system image of size 10 GB created by imp; 
it writes the newly created archive in the same file sys- 
tem. This workload is a data read and data write inten- 
sive workload. The data reads are satisfied by the Data 
Generator without accessing the available disk, while the 
data writes end up being squashed. 

grep uses the GNU grep utility to search for the ex- 
pression “nothing” in the content generated by both imp 
and tar. This workload issues significant amounts of data 
reads and small amounts of metadata reads. virus scan 
runs the AVG virus scanner on the file-system image cre- 
ated by imp. find and du run the GNU find and GNU du 
utilities over the content generated by both imp and tar. 
These two workloads are metadata read only workloads. 

David works well under both the implicit and ex- 
plicit approaches demonstrating its usefulness across file 
systems. Table 2 shows how David provides tremen- 
dous savings in the required storage capacity, upwards of 
99% (a 100-fold or more reduction) for most workloads. 
David also predicts benchmark runtime quite accurately. 
Prediction error for most workloads is less than 3%, al- 
though for a few it is just over 10%. The errors in the 
predicted runtimes stem from the relative simplicity of 
our in-kernel Disk Model; for example, it does not cap- 
ture the layout of physical blocks on the magnetic media 
accurately. This information is not published by the disk 
manufacturers and experimental inference is not possible 
for ATA disks that do not have a command similar to the 
SCSI mode page. 


7.4 David Scalability 


David is aimed at providing scalable emulation using 
commodity hardware; it is important that accuracy is 
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Measured 


Modeled 





Table 3: David Software RAID-1 Emulation. Shows 
IOPS for a software RAID-1 setup using David with memory as 
backing store; workload issues 20000 read and write requests 
through concurrent processes which equal the number of disks 
in the experiment. 1 disk experiments run w/o RAID-1. 


not compromised at larger scale. Figure 8 shows the 
accuracy and storage space savings provided by David 
while creating file-system images of 100s of GB. Using 
an available capacity of only 10 GB, David can model the 
runtime of Impressions in creating a realistic file-system 
image of 800 GB; in contrast to the linear scaling of the 
target capacity demanded, David barely requires any ex- 
tra available capacity. David also predicts the benchmark 
runtime within a maximum of 2.5% error even with the 
huge disparity between target and available disks at the 
800 GB mark, as shown in Figure 8. 

The reason we limit these experiments to a target ca- 
pacity of less than | TB is because we had access to only 
a terabyte sized disk against which we could validate the 
accuracy of David. Extrapolating from this experience, 
we believe David will enable one to emulate disks of 10s 
or 100s of TB given the | TB disk. 


7.5 David for RAID 

We present a brief evaluation and validation of software 
RAID-1 configurations using David. Table 3 shows a 
simple experiment where David emulates a multi-disk 
software RAID-1 (mirrored) configuration; each device 
is emulated using a memory-disk as backing store. How- 
ever, since the multiple disks contain copies of the same 
block, a single physical copy is stored, further reducing 
the memory footprint. In each disk setup, a set of threads 
which equal in number to the number of disks issue a to- 
tal of 20000 requests. David is able to accurately emulate 
the software RAID-1 setup upto 3 disks; more complex 
RAID schemes are left as part of future work. 


7.6 David Overhead 


David is designed to be used for benchmarking and not 
as a production system, thus scalability and accuracy are 
the more relevant metrics of evaluation; we do however 
want to measure the memory and CPU overhead of us- 
ing David on the available system to ensure it is prac- 
tical to use. All memory usage within David is tracked 
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Figure 9: David CPU and Memory Overhead. Shows 
the memory and percentage CPU consumption by David while 
creating a 10 GB file-system image using impressions. WOD 
CPU: CPU without David, SM CPU: CPU with Storage Model 
alone, D CPU: total CPU with David, SM Mem. Storage 
Model memory alone, D Mem: total memory with David. 


using several counters; David provides support to mea- 
sure the memory usage of its different components using 
ioctls. To measure the CPU overhead of the Storage 
Model alone, David is run in the model-only mode where 
block classification, remapping and data squashing are 
turned off. 

In our experience with running different workloads, 
we found that the memory and CPU usage of David is 
acceptable for the purposes of benchmarking. As an ex- 
ample, Figure 9 shows the CPU and memory consump- 
tion by David captured at 5 second intervals while cre- 
ating a 10 GB file-system image using Impressions. For 
this experiment, the Storage Model consumes less than 1 
MB of memory; the average memory consumed in total 
by David is less than 90 MB, of which the pre-allocated 
cache used by the Journal Snooping to temporarily store 
the journal writes itself contributes 80 MB. Amount of 
CPU used by the Storage Model alone is insignificant, 
however implicit classification by the Block Classifier 
is the primary consumer of CPU using 10% on average 
with occasional spikes. The CPU overhead is not an is- 
sue at all if one uses explicit notification. 


8 Related Work 


Memulator [10] makes a great case for why storage em- 
ulation provides the unique ability to explore nonexistent 
storage components and take end-to-end measurements. 
Memulator is a “timing-accurate” storage emulator that 
allows a simulated storage component to be plugged 
into a real system running real applications. Memula- 
tor can use the memory of either a networked machine or 
the local machine as the storage media of the emulated 
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disk, enabling full system evaluation of hypothetical stor- 
age devices. Although this provides flexibility in device 
emulation, high-capacity devices requires an equivalent 
amount of memory; David provides the necessary scala- 
bility to emulate such devices. In turn, David can benefit 
from the networked-emulation capabilities of Memula- 
tor in scenarios when either the host machine has limited 
CPU and memory resources, or when the interference of 
running David on the same machine competing for the 
same resources is unacceptable. 

One alternate to emulation is to simply buy a larger ca- 
pacity or newer device and use it to run the benchmarks. 
This is sometimes feasible, but often not desirable. Even 
if one buys a larger disk, in the future they would need 
an even larger one; David allows one to keep up with this 
arms race without always investing in new devices. Note 
that we chose | TB as the upper limit for evaluation in 
this paper because we could validate our results for that 
size. Having a large disk will also not address the issue 
of emulating much faster devices such as SSDs or RAID 
configurations. David emulates faster devices through an 
efficient use of memory as backing store. 

Another alternate is to simulate the storage component 
under test; disk simulators like Disksim [7] allow such an 
evaluation flexibly. However, simulation results are often 
far from perfect [9] — they fail to capture system depen- 
dencies and require the generation of representative I/O 
traces which is a challenge in itself. 

Finally, one might use analytical modeling for the stor- 
age devices; while very useful in some circumstances, 
it is not without its own set of challenges and limita- 
tions [20]. In particular, it is extremely hard to capture 
the interactions and complexities in real systems. Wher- 
ever possible, David does leverage well-tested analytical 
models for individual components to aid the emulation. 

Both simulation and analytical modeling are comple- 
mentary to emulation, perfectly useful in their own right. 
Emulation does however provide a reasonable middle 
ground in terms of flexibility and realism. 

Evaluation of how well an I/O system scales has been 
of interest in prior research and is becoming increas- 
ingly more relevant [28]. Chen and Patterson proposed 
a “self-scaling” benchmark that scales with the I/O sys- 
tem being evaluated, to stress the system in meaningful 
ways [8]. Although useful for disk and I/O systems, 
the self-scaling benchmarks are not directly applicable 
for file systems. The evaluation of the XFS file sys- 
tem from Silicon Graphics uses a number of benchmarks 
specifically intended to test its scalability [23]; such an 
evaluation can benefit from David to employ even larger 
benchmarks with greater ease; SpecSFS [27] also con- 
tains some techniques for scalable workload generation. 

Similar to our emulation of scale in a storage system, 
Gupta et al. from UCSD propose a technique called time 
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dilation for emulating network speeds orders of mag- 
nitude faster than available [11]. Time dilation allows 
one to experiment with unmodified applications running 
on commodity operating systems by subjecting them to 
much faster network speeds than actually available. 

A key challenge in David is the ability to identify data 
and meta-data blocks. Besides SDS [21], XN, the stable 
storage system for the Xok exokernel [12] dealt with sim- 
ilar issues. XN employed a template of metadata trans- 
lation functions called UDFs specific to each file type. 
The responsibility of providing UDFs rested with the file 
system developer, allowing the kernel to handle arbitrary 
metadata layouts without understanding the layout itself. 
Specifying an encoding of the on-disk scheme can be 
tricky for a file system such as ReiserFS that uses dy- 
namic allocation; however, in the future, David’s meta- 
data classification scheme can benefit from a more for- 
mally specified on-disk layout per file-system. 


9 Conclusion 


David is born out of the frustration in doing large-scale 
experimentation on realistic storage hardware — a prob- 
lem many in the storage community face. David makes it 
practical to experiment with benchmarks that were oth- 
erwise infeasible to run on a given system, by transpar- 
ently scaling down the storage capacity required to run 
the workload. The available backing store under David 
can be orders of magnitude smaller than the target de- 
vice. David ensures accuracy of benchmarking results 
by using a detailed storage model to predict the runtime. 
In the future, we plan to extend David to include support 
for a number of other useful storage devices and configu- 
rations. In particular, the Storage Model can be extended 
to support flash-based SSDs using an existing simulation 
model [5]. We believe David will be a useful emulator 
for file and storage system evaluation. 
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Abstract 


As file systems reach the petabytes scale, users and ad- 
ministrators are increasingly interested in acquiring high- 
level analytical information for file management and 
analysis. Two particularly important tasks are the pro- 
cessing of aggregate and top-k queries which, unfortu- 
nately, cannot be quickly answered by hierarchical file 
systems such as ext3 and NTFS. Existing pre-processing 
based solutions, e.g., file system crawling and index 
building, consume a significant amount of time and space 
(for generating and maintaining the indexes) which in 
many cases cannot be justified by the infrequent usage 
of such solutions. In this paper, we advocate that user in- 
terests can often be sufficiently satisfied by approximate - 
Le., Statistically accurate - answers. We develop Glance, 
a just-in-time sampling-based system which, after con- 
suming a small number of disk accesses, is capable of 
producing extremely accurate answers for a broad class 
of aggregate and top-k queries over a file system with- 
out the requirement of any prior knowledge. We use a 
number of real-world file systems to demonstrate the ef- 
ficiency, accuracy and scalability of Glance. 


1 Introduction 


Today a file system with billions of files, millions of di- 
rectories and petabytes of storage is no longer an excep- 
tion [29]. As file systems grow, users and administra- 
tors are increasingly keen to perform complex queries 
[37, 47], such as “How many files have been updated 
since ten days ago?”, and “Which are the top five largest 
files that belong to John?”. The first is an example of 
aggregate queries which provide a high-level summary 
of all or part of the file system, while the second is top- 
k; queries which locate the k files and/or directories that 
have the highest score according to a scoring function. 
Fast processing of aggregate and top-k queries are of- 
ten needed by applications that require just-in-time ana- 


lytics over large file systems, such as data management, 
archiving, enterprise surveillance, etc. The just-in-time 
requirement is defined by two properties: (1) file-system 
analytics must be completed within a short amount of 
time, and (2) the analyzer holds no prior knowledge (e.g., 
pre-processing results) of the file system being analyzed. 
For example, in order for a librarian to determine how to 
build an image archive from an external storage media 
(e.g., a Blue-ray disc), he/she may have to first estimate 
the total size of picture files stored on the external media 
- the librarian needs to complete data analytics quickly, 
over an alien file system that has never been seen before. 


Unfortunately, hierarchical file systems (e.g., ext3 and 
NTFS) are not well equipped for the task of just-in-time 
analytics [43]. The deficiency is in general due to the 
lack of a global view (1.e., high-level statistics) of meta- 
data information (e.g., size, creation, access and modifi- 
cation time). For efficiency concerns, a hierarchical file 
system is usually designed to limit the update of meta- 
data information to individual files and/or the immedi- 
ately preceding directories, leading to localized views. 
For example, while the last modification time of an indi- 
vidual file is easily retrievable, the last modification time 
of files that belong to user John is difficult to obtain be- 
cause such metadata information is not available at the 
global level. 


Currently, there are two approaches for generating 
high-level statistics from a hierarchical file system, and 
thereby answering aggregate and top-k queries: (1) scan- 
ning the file system upon the arrival of each query, e.g., 
the find command in Linux, which is inefficient for large 
file systems. While storage capacity increases ~60% per 
year, storage throughput and latency have much slower 
improvements, thus the amount of time required to scan 
an off-the-shelf hard drive or external storage media has 
increased significantly over time to become infeasible 
for just-in-time analytics. The above-mentioned image- 
archiving application is a typical example, as it is usu- 
ally impossible to completely scan an alien Blue-ray disc 
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within a short amount of time. (2) utilizing pre-built in- 
dexes which are regularly updated [3, 7, 26, 32, 36, 40]. 
Many desktop search products, e.g., Google Desktop 
[23] and Beagle [5], belong to this category. While this 
approach is capable of fast query processing once the 
(slow) index building process is complete, it may not be 
suitable or applicable to many just-in-time applications: 


e Index building can be unrealistic for many applica- 
tions that require just-in-time analytics over an alien 
file system. An example is enterprise surveillance 
[35], where portable machines and storage devices 
must be quickly examined before being allowed to 
join the enterprise network. 

Even if index can be built up-front, its signifi- 
cant cost may not be justifiable if the index is not 
frequently used afterwards. Unfortunately, this is 
common for some large file systems, e.g., storage 
archives or scratch data for scientific applications 
rarely require the global search function offered by 
the index, and may only need analytical queries 
to be answered infrequently (e.g., once every few 
days). In this case, building and updating an index 
is often an overkill given the high amortized cost. 
There are also other limitations of maintaining an 
index. For example, prior work [46] has shown that 
even after a file has been completely removed (from 
both the file system and the index), the (former) ex- 
istence of this file can still be inferred from the in- 
dex structure. Thus, a file system owner may choose 
to avoid building an index for privacy concerns. 


To enable just-in-time analytics, one must be able to 
perform an on-the-fly processing of analytical queries, 
over traditional file systems that normally have insuf- 
ficient metadata to support such complex queries. We 
achieve this goal by striking a balance between query 
answer accuracy and cost - providing approximate (1.e., 
statistically accurate) answers which, with a high confi- 
dence level, reside within a close distance from the pre- 
cise answer. For example, when a user wants to count 
the number of files in a directory (and all of its subdirec- 
tories), an approximate answer of 105,000 or 95, 000, 
compared with the real answer of 100,000, makes lit- 
tle difference to the high-level knowledge desired by the 
user. In general, the higher cost a user is willing to pay 
for answering a query, more accurate the answer can be. 

To this end, we design and develop Glance, a just-in- 
time query processing system which produces accurate 
query answers based on a small number of samples (files 
or folders) that can be collected from a very large file 
system with a few disk accesses. Glance is file-system 
agnostic, i.e., it can be applied instantly over any new file 
system and work seamlessly with the tree structure of the 
system. Glance removes the need of disk crawling and 
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index building, providing just-in-time analytics without 
a priori knowledge or pre-processing of the file systems. 
This is desirable in situations when the metadata indexes 
are not available, a query is not supported by the index, 
or query processing is only scarcely needed. 

Using sampling for processing analytical queries is by 
no means new. Studies on sampling flat files, hashed 
files, and files generated by a relational database system 
(e.g., a B+-tree file) started more than 20 years ago - see 
survey [39] - and were followed by a myriad of work on 
database sampling for approximate query processing in 
decision support systems - see tutorials [4, 15,22]. A 
wide variety of sampling techniques, e.g., simple ran- 
dom sampling [38], stratified [10], reservoir [48] and 
cluster sampling [11], have been used. Nonetheless, to 
the best of our knowledge, there has been no existing 
work on using sampling to support efficient aggregate 
and top-k query processing over a large hierarchical file 
system, i.e., one with numerous files organized in a com- 
plex folder structure (tree-like or directed acyclic graph). 

Our main contributions are two-fold: (1) Glance con- 
sists of two algorithms, FS_Agg and FS_TopK, for the ap- 
proximate processing of aggregate and top-k queries, re- 
spectively. For just-in-time analytics over very large file 
systems, we develop a random descent technique for un- 
biased aggregate estimations and a pruning-based tech- 
nique for top-k query processing. (2) We study the spe- 
cific characteristics of real-world file systems and derive 
the corresponding enhancements to our proposed tech- 
niques. In particular, according to the distribution of files 
in real-world file systems, we propose a high-level craw]- 
ing technique to significantly reduce the error of query 
processing. Based on an analysis of accuracy and ef- 
ficiency for the descent process, we propose a breadth- 
first implementation to reduce both error and overhead. 
We evaluate Glance over both real-world (e.g., NTFS, 
NFS, Plan 9) and synthetic file systems and find very 
promising results - e.g., 90% accuracy at 20% cost. Fur- 
thermore, we demonstrate that Glance is scalable to one 
billion of files and millions of directories. 

We would like to note, however, that Glance also has 
its limitations - there are certain ill-formed file systems 
that malicious users could potentially construct so that 
Glance cannot effectively handle. While we plan to ad- 
dress security applications in future work, our argument 
of Glance being a practical system for just-in-time ana- 
lytics is based upon the fact that these systems rarely ex- 
ist in practice. For example, Glance cannot accurately 
answer aggregate queries if a large number of folders 
are hundreds of levels below root. Nonetheless, real- 
world file systems would have far smaller depth, mak- 
ing such a scenario unlikely to occur. Similarly, Glance 
cannot efficiently handle cases where all files have ex- 
tremely close scores. This, however, is contradicted by 
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the heavy-tailed distribution observed on most meta-data 
attributes in real-world file systems [2]. 

The rest of the paper is organized as follows. Section 
2 presents the problem definition. In Section 3 and 4, we 
describe FS_Agg and FS_TopK for processing aggregate 
and top-k queries, respectively. The evaluation results 
are shown in Section 5. Section 6 reviews the related 
work, followed by the conclusion in Section 7. 


2 Problem Statement 


We now define the analytical queries, i.e., aggregate and 
top-& ones, which we focus on in this paper. The ex- 
amples we list below will be used in the experimental 
evaluation for testing the performance of Glance. 


Aggregate Queries: In general, aggregate queries are 
of the form SELECT AGGR(T) FROM D WHERE Selec- 
tion Condition, where D is a file system or storage de- 
vice, T is the target piece of information, which may be 
a metadata attribute (e.g., size, timestamp) of a file or a 
directory, AGGR is the aggregate function (e.g., COUNT, 
SUM, AVG), and Selection Condition specifies which 
files and/or directories are of interest. First, consider a 
system administrator who is interested in the total num- 
ber of files in the system. In this case, the aggregate 
query that the administrator would like to issue can be 
expressed as: 


Q1: SELECT COUNT(files) FROM filesystem; 


Further, the administrator may be interested in know- 
ing the total size of various types of document files, e.g., 


Q2: SELECT SUM(file. size) FROM filesystem WHERE 
file.extension IN { ‘txt’, ‘doc’}; 


If the administrator wants to compute the average size 
of all exe files from user John, the query becomes: 


Q3: SELECT AVG(file.size) FROM filesystem WHERE 
file.extension = ‘exe’ AND file.owner = ‘John’; 


Aggregate queries can also be more complex - the fol- 
lowing example shows a nested aggregate query for sci- 
entific computing applications. Suppose that each direc- 
tory is corresponding to a sensor and contains a number 
of files corresponding to the sensor readings received at 
different time. A physicist may want to count the number 
of sensors that has received at least one reading during 
the last 12 hours, 1.e., 


Q4: SELECT COUNT(directories) FROM filesystem 
WHERE EXISTS (SELECT * FROM filesystem WHERE 
file.dirname = directoryname AND file.mtime BE- 
TWEEN (now — 12 hours) AND now); 


Top-k Queries: In this paper, we also consider top- 
k queries of the form SELECT TOP k FROM D 


WHERE Selection Condition ORDER BY T DESCEND- 
ING/ASCENDING, where T is the scoring function 
based on which the top-k files or directories are selected. 
For example, a system administrator may want to select 
the 100 largest files, i.e., 


Q5: SELECT TOP 100 files FROM filesystem ORDER 
BY file.size DESCENDING; 


Another example is to find the ten most recently cre- 
ated directories that were modified yesterday, 1.e., 


Q6: SELECT TOP 10 directories FROM filesystem 
WHERE directory.mtine BETWEEN (now — 24 hours) 
AND now ORDER BY directory.ctime DESCENDING; 


We note that, to approximately answer a top-k query, 
one shall return a list of k items that share a large per- 
centage of common ones with the precise top-k list. 

Current operating systems and storage devices do not 
provide APIs which directly support the above-defined 
aggregate and top-k queries. The objective of just-in- 
time analytics can be stated as follows. 


Problem Statement (Objective of Just-In-Time Analyt- 
ics over File Systems): To enable the efficient approx- 
imate processing of aggregate and top-k queries over a 
file system by using the file/directory access APIs pro- 
vided by the operating system. 

To complete the problem statement, we need to de- 
termine how to measure the efficiency and accuracy of 
query processing. For the purpose of this paper, we 
measure the query efficiency in two metrics: 1) query 
time, i.e., the runtime of query processing, and 2) query 
cost, i.e., the ratio of the number of directories visited by 
Glance to that of crawling the file system (1.e., the total 
number of directories in the system). We assume that one 
disk access is required for reading a new directory. Thus, 
the query cost approximates the number of disk accesses 
required by Glance. The two metrics, query time and 
cost, are positively correlated - the higher the query cost 
is, more directories the algorithm has to sample, leading 
to a longer runtime. 

While the efficiency measures are generic to both ag- 
gregate and top-& query processing, the measures for 
query accuracy are different. For aggregate queries, we 
define the query accuracy as the relative error of the ap- 
proximate answer apx compared with the precise one 
ans - i.e., |apx — ans|/|ans|. For top-k queries, we 
define the accuracy as the percentage of items that are 
common in the approximate and precise top-k lists. The 
accuracy level required for approximate query process- 
ing depends on the intended application. For example, 
while scientific computing usually requires a small error, 
the above-mentioned surveillance application may sim- 
ply need a ball-park figure to determine whether there is 
a significant amount of sensitive files in the system. 
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3 Aggregate Query Processing 


In this section, we develop FS_Agg, our algorithm 
for processing aggregate queries. We first describe 
FS_Agg Basic, a vanilla algorithm which illustrates our 
main idea of aggregate estimation without bias through a 
random descent process within a file system. Then, we 
describe two ideas to make the vanilla algorithm practical 
over very large file systems: high-level crawling lever- 
ages the special properties of a file system to reduce the 
standard error of estimation, and breadth-first implemen- 
tation improves both accuracy and efficiency of query 
processing. Finally, we combine all three techniques to 
produce FS_Agg. 


3.1 FS_Agg Basic 


A Random Descent Process: In general, the folder or- 
ganization of a file system can be considered as a tree or a 
directed acyclic graph (DAG), depending on whether the 
file system allows hard links to the same file. The random 
descent process we are about to discuss can be applied to 
both cases with little change. For the ease of understand- 
ing, we first focus on the case of tree-like folder structure, 
and then discuss a simple extension to DAG at the end of 
this subsection. 
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Figure 1: Random descents on a tree-like structure 
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Figure | depicts a tree structure with root correspond- 
ing to the root directory of a file system, which we shall 
use as a running example throughout the paper. One can 
see from the figure that there are two types of nodes in 
the tree: folders (directories) and files. A file is always 
a leaf node. The children of a folder consist of all sub- 
folders and files in the folder. We refer to the branches 
coming out of a folder node as subfolder-branches and 
file-branches, respectively, according to their destination 
type. We refer to a folder with no subfolder-branches 
as a leaf-folder. Note that this differs from a leaf in the 
tree, which can be either a file or a folder containing nei- 
ther subfolder nor file. The random descent process starts 
from the root and ends at a leaf-folder. At each node, 
we choose a subfolder branch of the node uniformly at 
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random for further exploration. During the descent pro- 
cess, we evaluate all file branches encountered at each 
node along the path, and generate an aggregate estima- 
tion based on these file branches. 

To make the idea more concrete, consider an exam- 
ple of estimating the COUNT of all files in the system. 
At the beginning of random descent, we access the root 
to obtain the number of its file- and subfolder-branches 
fo and so, respectively, and record them as our evalua- 
tion for the root. Then, we randomly choose a subfolder- 
branch for further descent, and repeat this process until 
we arrive at a folder with no subfolder. Suppose that the 
numbers we recorded during such a descent process are 
fo, 80, f1, 51,---, fn, Sn, Where s;, = 0 because each de- 
scent ends at a leaf-folder. We estimate the COUNT of 
all files as 


i—1 


h 
a= >o| fe IL s7}, (1) 
i=0 j=0 

where Ls 8; 1s assumed to be 1 when 7 = 0. Two ex- 
amples of such a random descent process are marked in 
Figure | as red solid and blue dotted lines, respectively. 
The solid descent produces (fo, fi, fo) = (2,2,2) and 
(S09, $1, 52) = (4,1,0), leading to an estimation of 2 + 
8 + 8 = 18. The dotted one produces (fo, fi, fo) = 
(2,0, 1) and (so, 81,52) = (4,2,0), leading to an esti- 
mation of 2 + 0 + 8 = 10. The random descent process 
can be repeated multiple times (by restarting from the 
root) to produce a more accurate result (by taking the av- 
erage of estimations generated by all descents). 


Unbiasedness: Somewhat surprisingly, the estimation 
produced by each random descent process is completely 
unbiased - i.e., the expected value of the estimation is 
exactly equal to the total number of files in the system. 
To understand why, consider the total number of files at 
the z-th level (with root being Level 0) of the tree (e.g., 
Files | and 2 in Figure 1 are at Level 3), denoted by F;. 
According to the definition of a tree, each 7-level file be- 
longs to one and only one folder at Level 2 — 1. For 
each (i — 1)-level folder v;_1, let |u;_1| and p(vj_1) be 
the number of (7-level) files in v;_, and the probability 
for v;_; to be reached in the random descent process, 
respectively. One can see that |v;—1|/p(vj_1) is an unbi- 
ased estimation for F'(i) because 


B( esa ) = (0-2): 


p(vj—-1) iit 


|vi—a| 
p(vi—1) 








) =F. 2) 


With our design of the random descent process, the prob- 
ability p(v;_1) is 


i-2 1 
t— = ’ 3 
p(vj-1) U 3j(%-1) (3) 
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where s;(v;—1) is the number of subfolder-branches for 
each node encountered on the path from the root to v;_1. 
Our estimation in (1) is essentially the sum of the unbi- 
ased estimations in (2) for all ¢ € [1, m], where m is the 
maximum depth of a file. Thus, the estimation generated 
by the random descent is unbiased. 


Processing of Aggregate Queries: While the above ex- 
ample is for estimating the COUNT of all files, the same 
random descent process can be used to process queries 
with other aggregate functions (e.g., SUM, AVG), with 
selection conditions (e.g., COUNT all files with exten- 
sion ’.JPG’), and in file systems with a DAG instead of 
tree structure. We now discuss these extensions. In par- 
ticular, we shall show the only change required for all 
these extensions is on the computation of f;. 

SUM: For the COUNT query, we set f; to the num- 
ber of files in a folder. To process a SUM query over 
a file metadata attribute (e.g., file size), we simply set 
fj, as the SUM of such an attribute over all files in the 
folder (e.g., total size of all files). In the running exam- 
ple, consider the estimation of SUM of numbers shown 
on all files in Figure 1. The solid and dotted random 
walks will return (fo, f1, f2) = (15, 7,3) and (15, 0,5), 
respectively, leading to the same estimation of 55. The 
unbiasedness of such an estimation follows in analogy 
from the COUNT case. 

AVG: A simple way to process an AVG query is to 
estimate the corresponding SUM and COUNT respec- 
tively, and then compute AVG as SUM/COUNT. Note, 
however, that such an estimation is no longer unbiased, 
because the division of two unbiased estimations is not 
necessarily unbiased. While an unbiased AVG estima- 
tion may indeed be desired for certain applications, we 
have proved a negative result that it is impossible to an- 
swer an AVG query without bias unless one accesses the 
file system for almost as many as times as crawling the 
file system. We omit the detailed proof here due to the 
space limitation. Nonetheless, for practical purposes, es- 
timating AVG as SUM/COUNT is in general very accu- 
rate, as we shall show in the experimental results. 

Selection Conditions: To process a query with selec- 
tion conditions, the only change required is, again, on the 
computation of f;. Instead of evaluating f; over all file 
branches of a folder, to answer a conditional query, we 
only evaluate f; over the files that satisfy the selection 
conditions. For example, to answer a query SELECT 
COUNT(*) FROM Files WHERE file.extension = 
JPG’, we should set f; as the number of files under the 
current folder with extension JPG. Similarly, to answer 
“SUM(file_size) WHERE owner = John”, we should 
set f; to the SUM of sizes for all files (under the current 
folder) which belong to John. Due to the computation 
of f; for conditional queries, the descent process may be 
terminated early to further reduce the cost of sampling. 


Again consider the query condition of (owner = John). 
If the random descent reaches a folder which cannot be 
accessed by John, then it can terminate immediately be- 
cause any deeper descent can only return f; = 0, leading 
to no change in the estimation. 


Extension to DAG Structure: Finally, for a file system 
featuring a DAG (instead of tree) structure, we again only 
need to change the computation of f;. Almost all DAG- 
enabled file systems (e.g., ext2, ext3, NTFS) provide a 
reference count for each file which indicates the number 
of links in the DAG that point to the file'. For a file with 
r links, if we use the original algorithm discussed above, 
then the file will be counted r times in the estimation. 
Thus, we should discount its impact on each estimation 
with a factor of 1/r. For example, if the query being pro- 
cessed is the COUNT of all files, then we should com- 
pute fi = i rer(1/r(f)), where F is the set of files 
under the current folder, and r(f) is the number of links 
to each file f. Similarly, to estimate the SUM of all file 
sizes, we should compute f; = >) ep(size(f)/r(f)), 
where size(f) is the file size of file f. One can see that 
with this discount factor, we maintain an unbiased esti- 
mation over a DAG file system structure. 


3.2 Disadvantages of FS Agg Basic 


While the estimations generated by FS_Agg Basic is un- 
biased for SUM and COUNT queries, it is important to 
understand that the error of an estimation comes from not 
only bias but also variance (i.e., standard error). A prob- 
lem of FS_Agg Basic is that it may produce a high esti- 
mation variance for file systems with an undesired distri- 
bution of files, as illustrated by the following theorem: 


Theorem 1. The variance of estimation produced by a 
random descent on the number of h-level files F}, is 


h—-2 

oh =| 7 (wo? TT si) ]-F. 
vELn-1 j=0 

where Ly, is the set of all folders at Level h — 1, |v| is 





the number of files in a folder v, and s;(v) is the number 
of subfolders for the Level-j node on the path from the 
root to v. 


Proof. Consider an (h — 1)-level folder v. If the ran- 
dom descent reaches v, then the estimation it produces 
for the number of h-level files is |v|/p(v), where p(v) is 
the probability for the random descent to reach v. Let 
6(h) be the probability that a random descent terminates 


‘Tn ext2 and ext3, for example, the system provides the number of 
hard links for each file. Note that for soft links, we can simply ignore 
them during the descent process. Thus, they bear no impact on the final 
estimation. 
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early before reaching a Level-(h — 1) folder. Since each 
random descent reaches at most one Level-(h—1) folder, 
the estimation variance for F}, is 


|| 
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Since p(v) = 1/ gba s;(v), the theorem is proved. 





One can see from the theorem that the existence of 
two types of folders may lead to an extremely high esti- 
mation variance: One type is high-level leaf-folders (1.e., 
“shallow” folders with no subfolders). Folder c in Fig- 
ure 1 is an example. To understand why such folders 
lead to a high variance, consider (7) in the proof of The- 
orem 1. Note that for a large h, a high-level leaf-folder 
(above Level-(h — 1)) reduces }),-7,_, p(v) because 
once a random descent reaches such a folder, it will not 
continue to retrieve any file in Level-h (e.g., Folder c in 
Figure | stops further descents for h = 3 or 4). As a re- 
sult, the first item in (7) becomes higher, increasing the 
estimation variance. For example, after removing Folder 
c from Figure 1, the estimation variance for the number 
of files on Level 3 can be reduced from 24 to 9. 

The other type of “ill-conditioned” folders are those 
deep-level folders which reside at much lower levels than 
others (i.e., with an extremely large h). An example is 
Folder j in Figure 1. The key problem arising from such 
a folder is that the probability for it to selected is usually 
extremely small, leading to an estimation much larger 
than the real value if the folder happens to be selected. As 
shown in Theorem 1, a larger h leads to a higher [] s;(v), 
which in turn leads to a higher variance. For example, 
Folder j in Figure 1 has [[ s;(v) =4x 2x 3x 3 = 72, 
leading to a estimation variance of 72 — 1 = 71 for the 
number of files on Level 5 (which has a real value of 1). 


3.3 FS_Agg 


To reduce the estimation variance, we propose high-level 
crawling and breadth-first descent to address the two 
above-described problems on estimation variance, high- 
level leaf-folders and deep-level folders, respectively. 
Also, we shall discuss how the variance generated by 
FS_Agg can be estimated in practice, effectively produc- 
ing a confidence interval for the aggregate query answer. 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


High-Level Crawling is designed to eliminate the nega- 
tive impact of high-level leaf-folders on estimation vari- 
ance. The main idea of high-level crawling is to access 
all folders in the highest 7 levels of the tree - by following 
all subfolder-branches of folders accessed on or above 
Level-(i — 1). Then, the final estimation becomes an 
aggregate of two components: the precise value over the 
highest 2 levels and the estimated value (produced by ran- 
dom descents) over files below Level-z. One can see from 
the design of high-level crawling that now leaf-folders in 
the first i levels no longer reduce p(v) for folders v be- 
low Level-2 (and therefore no longer adversely affect the 
estimation variance). Formally, we have the following 
theorem? which demonstrates the effectiveness of high- 
level crawling on reducing the estimation variance: 


Theorem 2. /f 19 out of r folders crawled from the first i 
levels are leaf-folders, then the estimation variance pro- 
duced by a random descent for the number of Level-h 
files F), satisfies 
2 2 
rene < aD ama a (8) 
According to this theorem, if we apply high-level 
crawling over the first level in Figure 1, then the esti- 
mation variance for the number of files on Level 3 is at 
most (3-24—1-36)/4 = 9. Recall from Section 4.2 that 
the variance of estimation after removing Folder c (the 
only leaf-folder at the first level) is exactly 9. Thus, the 
bound in Theorem 2 is tight in this case. 


Breadth-First Descent is designed to bring two advan- 
tages over FS_Agg Basic: variance reduction and run- 
time improvement, which we shall explain as follows. 
Variance Reduction: breadth-first descent starts from the 
root of the tree. Then, at any level of the tree, it generates 
a set of folders to access at the next level by randomly 
selecting from subfolders of all folders it accesses at the 
currently level. Note that any random selection process 
would work - as long as we know the probability for a 
folder to be selected, we can answer aggregate queries 
without bias in the same way as the original random de- 
scent process. For example, to COUNT the number of 
all files in the system, an unbiased estimation of the total 
number of files at Level i is the SUM of |v;~-1|/p(vi-1) 
for all Level-(¢—1) folders v;_1 accessed by the breadth- 
first implementation, where |v;-1| and p(vj_1) are the 
number of file-branches and the probability of selection 
for v;_1, respectively. 

We use the following random selection process in 
Glance: Consider a folder accessed at the current level 
which has no subfolders. From these no subfold- 
ers, we sample without replacement min(no, max(Psel - 


2In the rest part of the paper, we do not include the proof of theo- 
rems due to the space limitation. 
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70; Smin)) Ones for access at the next level. Here pse1 € 
(0, 1] (where sel stands for selection) represents the prob- 
ability of which a subfolder will be selected for sampling, 
and Smin > 1 states the minimum number of subfolders 
that will be sampled. Both pg.) and Sin are user-defined 
parameters, the settings for which we shall further dis- 
cuss in the experiments section based on characteristics 
of real-world file systems. 

Compared with the original random descent design, 
this breadth-first random selection process significantly 
increases the selection probability for a deep folder. Re- 
call that with the original design, while drilling down 
one level down the tree, the selection probability can de- 
crease rapidly by a factor of the fan-out (i.e., the number 
of subfolders) of the current folder. With breadth-first 
descent, on the other hand, the decrease is limited to at 
most a factor of 1/pse1, which can be much smaller than 
the fanout when ps-) is reasonably high (e.g., =0.5 as we 
shall suggest in the experiments section). As a result, 
the estimation generated by a deep folder becomes much 
smaller. Formally, we have the following theorem. 


Theorem 3. With breadth-first descent, the variance of 
estimation on the number of h-level files Fp, satisfies 


ol? 
aie 


2 
veLn-1 Psel 


oprs(h)? < 





(9) 


One can see from a comparison with Theorem | that 
the factor of [] s;(v) in the original variance, which can 
grow to an extremely large value, is now replaced by 
1/ oy ' which can be better controlled by the Glance sys- 
tem to remain at a low level even when h is large. 
Runtime Improvement: In the original design of 
FS_Agg_ Basic, random descent has to be performed mul- 
tiple times to reduce the estimation variance. Such mul- 
tiple descents are very likely to access the same folders, 
especially the high-level ones. While one can leverage 
the history of hard-drive accesses by caching all his- 
toric accesses in memory, such repeated accesses can 
still take significant CPU time for in-memory look up. 
The breadth-first design, on the other hand, ensures that 
each folder is accessed at most once, reducing the run- 
time overhead of the Glance system. 


Variance Produced by FS_Agg: An important issue 
for applying FS_Agg in practice is how one can esti- 
mate the error of approximate query answers it produces. 
Since FS_Agg generates unbiased answers for SUM and 
COUNT queries, the key enabling factor for error estima- 
tion here is an accurate computation of the variance. One 
can see from Theorem 3 that variance depends on the 
specific structure of the file system, in particular the dis- 
tribution of selection probability p,.; for different fold- 
ers. Since our sampling-based algorithm does not have 


a global view of the hierarchical structure, it cannot pre- 
cisely compute the variance. 

Fortunately, the variance can still be accurately ap- 
proximated in practice. To understand how, consider first 
the depth-first descents used in FS_Agg Basic. Each de- 
scent returns an independent aggregate estimation, while 
the average for multiple descents becomes the final ap- 
proximate query answer. Let qi,..., qd, be the indepen- 
dent estimations and g = ()> q;)/h be the final answer. 
A simple method of variance approximation is to com- 
pute var(G,..-,9n)/h, where var(-) is the variance of 
independent estimations returned by the descents. Note 
that if we consider a population consisting of estimations 
generated by all possible descents, then q,,...,@, form 
a sample of the population. As such, the variance compu- 
tation is approximating the population variance by sam- 
ple variance, which are asymptotically equal (for an in- 
creasing number of descents). 

We conducted extensive experiments described in Sec- 
tion 5 to verify the accuracy of such an approximation. 
Figure 2 shows two examples for counting the total num- 
ber of files in an NTFS and a Plan 9 file system, re- 
spectively. Observe from the figure that the real vari- 
ance oscillates in the beginning of descents. For exam- 
ple, we observe at least one spike on each file system 
within the first 100 descents. Such a spike occurs when 
one descent happens to end with a deep-level file which 
returns an extremely large estimation, and is very likely 
to happen with our sampling-based technique. Nonethe- 
less, note that the real variance converges to a small 
value when the number of descents is sufficiently large 
(e.g., > 400). Also note that for two file systems after 
a small number of descents (about 50), the sample vari- 
ance var(q1,.--,Qn)/h becomes an extremely accurate 
approximation for the real (population) variance (over- 
lapping shown in Figure 2), even during the spikes. One 
can thereby derive an accurate confidence interval for the 
query answer produced by FS_Agg_ Basic. 
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Figure 2: Variance approximation for (a) an NTFS file 
system and (b) a Plan 9 system. Real and sample vari- 
ances are overlapped when the number of descents is suf- 
ficiently large. 


While FS_Agg no longer performs individual depth- 
first descents, the idea of using sample variance to ap- 
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proximate population variance still applies. In partic- 
ular, note that for any given level, say Level-2, of the 
tree structure, each folder randomly chosen by FS_Agg 
at Level-(i — 1) produces an independent, unbiased, es- 
timation for SUM or COUNT aggregate over all files 
in Level-2. Thus, the variance for an aggregate query 
answer over Level-i can be approximated based on the 
variance of estimations generated by the individual fold- 
ers. The variance of final SUM or COUNT query answer 
(over the entire file system) can then be approximated by 
the SUM of variances for all levels. 


4 Top-k Query Processing 


Recall that for a given file system, a top-/ query is de- 
fined by two elements: the scoring function and the se- 
lection conditions. Without loss of generality, we con- 
sider a top-k query which selects k files (directories) with 
the highest scores. For the sake of simplicity, we focus 
on top-k queries without selection conditions, and con- 
sider a tree-like structure of the file system. The exten- 
sions to top-k queries with selection conditions and file 
systems with DAG structures follow in analogy from the 
same extensions for FS_Agg. 


4.1 Main Idea 


A simple way to answer a top-k query is to access ev- 
ery directory to find the k files with the highest scores. 
The objective of FS_TopK is to generate an approximate 
top-k list with far fewer hard-drive accesses. To do so, 
FS_TopK consists of the following three steps. We shall 
describe the details of these steps in the next subsection. 


1. A Lower-Bound Estimation: The first step uses a 
random descent similar to FS_Agg to generate an 
approximate lower bound on the k-th highest score 
over the entire file system (i.e., among files that sat- 
isfy the selection conditions specified in the query). 

2. Highest-Score Estimations and Tree Pruning: In the 
second step, we prune the tree structure of the file 
system according to the lower bound generated in 
Step 1. In particular, for each subtree, we use the 
results of descents to generate an upper-bound esti- 
mate on the highest score of all files in the subtree. 
If the estimation is smaller than the lower bound 
from Step 1, we remove the subtree from search 
space because it is unlikely to contain a top-k file. 
Note that in order for such a pruning process to have 
a low false negative rate - i.e., not to falsely remove 
a large number of real top-k files, a key assumption 
we are making here is the “locality” of scores - i.e., 
files with similar scores are likely to co-locate in the 
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same directory or close by in the tree structure. In- 
tuitively, the files in a directory are likely to have 
similar creation and update times. In some cases 
(e.g., images in the ’My Pictures” directory, and 
outputs from a simulation program), the files will 
likely have similar sizes too. Note that the strength 
of this locality is heavily dependent on the type of 
the query and the semantics of the file system on 
which the query is running. We plan to investigate 
this issue as part of the future work. 

3. Crawling of the Selected Tree: Finally, we crawl the 
remaining search space - i.e., the selected tree - by 
accessing every folder in it to locate the top-k files 
as the query answer. Such an answer is approximate 
because some real top-k files might exist in the se- 
lected subtrees, albeit with a small probability, as 
we shall show in the experimental results. 


In the running example, consider a query for the top-3 
files with the highest numbers shown in Figure 1. Sup- 
pose that Step 1 generates a (conservative) lower bound 
of 8, and the highest scores estimated in Step 2 for sub- 
trees with roots a, c, d, and m are 5, -1 (i.e., no file), 
7, and 15, respectively - the details of these estimations 
will be discussed shortly. Then, the pruning step will re- 
move the subtrees with roots a, c, and d, because their 
estimated highest scores are lower than the lower bound 
of 8. Thus, the final crawling step only needs to access 
the subtree with root of a. In this example, the algorithm 
would return the files identified as 8, 9, and 10, locating 
two top-3 files while crawling only a small fraction of the 
tree. Note that the file with the highest number 11 could 
not be located here because the pruning step removes the 
subtree with root of d. 


4.2 Detailed Design 


The design of FS_TopK is built upon a hypothesis that 
the highest scores estimated in Step 2, when compared 
with the lower bound estimated in Step 1, can prune a 
large portion of the tree, significantly reducing the over- 
head of crawling in Step 3. In the following, we first de- 
scribe the estimations of the lower bound and the highest 
scores in Steps | and 2, and then discuss the validity of 
the hypothesis for various types of scoring functions. 
Both estimations in the two steps can be made from 
the order statistics [20] of files retrieved by the random 
descent process in FS_Agg. The reason is that both esti- 
mations are essentially on the order statistics of the pop- 
ulation (1.e., all files in the system) - The lower bound in 
Step 1 is the k-th largest order statistics of all files, while 
the highest scores are on the largest order statistics of the 
subtrees. We refer readers to [20] for details of how the 
order statistics of a sample can be used to estimate that 
of the population and how accurate such an estimation is. 
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While sampling for order statistics is a problem of its 
own right, for the purpose of this paper, we consider the 
following simple approach which, according to our ex- 
periments over real-world file systems, suffices for an- 
swering top-k queries accurately and efficiently over al- 
most all tested systems: For the lower-bound estimation 
in Step 1, we use the sample quantile as an estimation 
of the population quantile. For example, to estimate the 
100-th largest score of a system with 10, 000 files, we use 
the largest score of a 100-file sample as an estimation. 
Our tests show that for many practical scoring functions 
(which usually have a positive skew, as we shall discuss 
below), the result serves as a conservative lower bound 
desired by FS_TopK. For the highest-score estimation 
in Step 2, we simply compute + - max(sample scores), 
where 7¥ is a constant correction parameter. The setting 
of y captures a tradeoff between the crawling cost and 
the chances of finding top-k files - when a larger + is se- 
lected, less number of the subtrees are likely be removed. 


We now discuss when the hypothesis of heavy prun- 
ing is valid and when it is not. Ideally, two conditions 
should be satisfied for the hypothesis to hold: (1) If a 
subtree includes a top-k file, then it should include a (rel- 
atively) large number of highly scored files, in order for 
the sampling process (in Step 2) to capture one (and to 
thereby produce a highest-score estimation that surpasses 
the lower bound) with a small query cost. And (2) on the 
other hand, most subtrees (which do not include a top- 
k; file) should have a maximum score significantly lower 
than the k-th highest score. This way, a large number 
of subtrees can be pruned to improve the efficiency of 
top-k query processing. In general, one can easily con- 
struct a scoring function that satisfy both or neither of 
the above two conditions. We focus on a special class 
of scoring functions: those following a heavy-tailed dis- 
tributions (i.e., its cumulative distribution function F'(-) 
satisfies lim,_,.. e**(1 — F(x)) = oo for all X > 0). 
Existing studies on real-world file system traces showed 
that many file/directory metadata attributes, which are 
commonly used as scoring functions, belong to this cat- 
egory [2]. For example, the distributions of file size, last 
modified time, creation time, etc., in the entire file sys- 
tem or in a particular subtree are likely to have a heavy 
tail on one or both extremes of the distribution. 


A key intuition here is that scoring functions defined 
as such attribute values (e.g., finding the top-k files with 
the maximum sizes or the latest modified time) usually 
satisfy both conditions: First, because of the long tail, 
a subtree which includes a top-k scored file is likely to 
include many other highly scored files as well. Second, 
since the vast majority of subtrees have their maximum 
scores significantly smaller than the top-k lower bound, 
the pruning process is likely to be effective with such a 
scoring function. 


We would also like to point out an “opposite” class of 
scoring functions for which the pruning process is not ef- 
fective: the inverse of the above scoring functions - e.g., 
the top-k files with the smallest sizes. Such a scoring 
function, when used in a top-k& query, selects & files from 
the “crowded” light-tailed side of the distribution. The 
pruning is less likely to be effective because many other 
folders may have files with similar scores, violating the 
second condition stated above. Fortunately, asking for 
top-k smallest files is not particularly useful in practice, 
also because of the fact that it selects from the crowded 
side - e.g., the answer is likely to be a large number of 
empty files. 


5 Implementation and Evaluation 


5.1 Implementation 


We implemented Glance, including all three algorithms 
(FS_Agg Basic, FS_Agg and FS_TopK) in 1,600 lines of 
C code in Linux. We also built and used a simulator in 
Matlab to complete a large number of tests within a short 
period of time. While the implementation was built upon 
the ext3 file system, the algorithms are generic to any 
hierarchical file system and the current implementation 
can be easily ported to other platforms, e.g., Windows 
and Mac OS. FS_Agg_Basic has only one parameter: the 
number of descents. FS_Agg has three parameters: the 
selection probability pse1, the minimum number of selec- 
tions Sin and the number of (highest) levels for crawling 
h. Our default parameter settings are psc) = 50%, Smin = 
3, and h = 4. We also tested with other combinations of 
parameter settings. FS_TopK has one additional param- 
eter, the (estimation) enlargement ratio y. The setting of 
yy depends on the query to be answered, which shall be 
explained later. 


5.2 Experiment Setup 


Test Platform: We ran all experiments on Linux ma- 
chines with Intel Core 2 Duo processor, 4GB RAM, and 
1TB Samsung 7200RPM hard drive. Unless otherwise 
specified, we ran each experiment for five times and re- 
ported the averages. 

Windows File Systems: The Microsoft traces [2] in- 
cludes the snapshots of around 63,000 file systems, 80% 
of which are NTFS and the rest are FAT. To test Glance 
over file systems with a wide range of sizes, we first 
selected from the traces two file systems, m1l00K and 
m1M (the first ‘m’ stands for Microsoft trace), which 
are the largest file systems with less than 100K and 1M 
files, respectively. Specifically, m1004 has 99,985 files 
and 16,013 directories, and m1M has 998,472 files and 
106,892 directories. We also tested the largest system in 


FAST °’11: 9th USENIX Conference on File and Storage Technologies 


225 


226 


the trace, m10/M, which has the maximum number of 
files (9,496,510) and directories (789,097). We put to- 
gether the largest 33 file systems in the trace to obtain 
m100M that contains over 100M files and 7M directo- 
ries. In order to evaluate next-generation billion-level file 
systems for which there are no available traces, we chose 
to replicate m100M for 10 times to create m1B with 
over | billion files and 70M directories. While a similar 
scale-up approach has been used in the literature [26,49], 
we would like to note that the duplication-filled system 
may exhibit different properties from a real system with 
100M or 1B files. As part of future work, we shall evalu- 
ate our techniques in real-world billion-level file systems. 
Plan 9 File Systems: Plan 9 is a distributed file system 
developed and used at the Bell Labs [41,42]. We re- 
played the trace data collected on two central file servers 
bootes and emelie, to obtain two file systems, pb (for 
bootes) and pe (for emelie), each of which has over 2M 
files and 70-80K directories. 

NFS: Here we used the Harvard trace [21, 45] that con- 
sists of workloads on NFS servers. The replay of one- 
day trace created about 1,500 directories and 20K files. 
Again, we scaled up the one-day system to a larger 
file system nfs (2.3M files and 137K folders), using the 
above-mentioned approach. 

Synthetic File Systems: To conduct a more compre- 
hensive set of experiments on file systems with differ- 
ent file and directory counts, we used Impressions [1] 
to generate a set of synthetic file systems. By adjust- 
ing the file count and the (expected) number of files per 
directory, we used Impressions to generate three file sys- 
tems, 710K, 21004, and 11M (here ‘2’ stands for Im- 
pressions), with file counts 1OK, 100K, and 1M, and di- 
rectory counts 1K, 10K, and 100K, respectively. 


5.3 Aggregate Queries 


We first considered Q1 discussed in Section 2, i.e., the to- 
tal number of files in the system. To provide a more intu- 
itive understanding of query accuracy (than the arguably 
abstract measure of relative error), we used the Matlab 
simulator (for quick simulation) to generate a box plot 
(Figure 3(a)) of estimations and overhead produced by 
Glance on Q1 over five file systems, m100K to m10M, 
pb and pe. Remember as defined in Section 2, the query 
cost (in Figure 3(b) and the following figures) is the ratio 
between the number of directories visited by Glance and 
that by file-system crawling. One can see that Glance 
consistently generates accurate query answers, e.g., for 
m10M, sampling 30% of directories produces an answer 
with 2% average error. While there are outliers, the num- 
ber of outliers is small and their errors never exceed 7%. 

We also evaluated Glance with other file systems and 
varied the input parameter settings. This test was con- 
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Figure 3: Box plots of accuracy and cost of 100 trials 


ducted on the Linux and ext3 implementation, and so 
were the following tests on aggregate queries. In this 
test, we varied the minimum number of selections spin 
from 3 to 6, the number of crawled levels h from 3 to 5, 
and set the selection probability as ps) = 50% (i.e., half 
of the subfolders will be selected if the amount is more 
than sin). Figure 4 shows the query accuracy and cost 
on the eleven file systems we tested. For all file systems, 
Glance was able to produce very accurate answers (with 
<10% relative error) when crawling four or more levels 
(i.e., h > 4). Also note from Figure 4 that the perfor- 
mance of Glance is less dependent on the type of the file 
system than its size - it achieves over 90% accuracy for 
NFS, Plan 9, and NTFS (m10M to m1B). Depending on 
the individual file systems, the cost ranges from less than 
12% of crawling for large systems with 1B files and 80% 
for the small 100K system. The algorithm scales very 
well to large file systems e.g., m100M and m1B - the 
relative error is only 1-3% when Glance accesses only 
10-20% of all directories. For m1B, the combination of 
Psel = 50%, Smin = 3 and h = 4 produces 99% accuracy 
with very little cost (12%). 
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Figure 5: Query accuracy vs. run time in seconds. Three 
points of each line (from left to right) represent h of 3, 4, 
and 5, respectively. 





























Figure 5 illustrates the runtimes (in seconds) for ag- 
gregate queries. The absolute runtime depends heavily 
on the size of the file system, e.g., seconds for m100K, 
several minutes for nfs (2.3M files), and 1.2 hours for 
m100M (not shown in the figure). Note that in this paper 
we only used a single hard drive; parallel IO to multiple 
hard drives (e.g., RAID) will be able to utilize the aggre- 
gate bandwidth to further improve the performance. As 
the value of h increases, the query runs slightly longer 
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Figure 4: Accuracy and cost of aggregate queries under different settings of the input parameters. Label 3-3 stands for 
h of 3 and Sin of 3, 3-6 for h of 3 and sin of 6, etc., while psc) is 50% for all cases. 


but the accuracy improves by about 10% for pb and 20% 
for pe. The accuracy improvements for m10M and nfs 
are smaller. The value of syn is 3 in this test. 
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Figure 6: Accuracy and cost of queries 


We also considered other aggregate queries with vari- 
ous aggregate functions and with/without selection con- 
ditions. Figure 6(a) presents the accuracy and cost of 
evaluating the SUM and AVG of file sizes for all files 
in the system, while Figure 6(b) depicts the same for 
exe files. We included in both figures the accuracy of 
COUNT because AVG is calculated as SUM/COUNT. 
Both SUM and AVG queries receive very accurate an- 
swers, e.g., only 2% relative error for m10M with or 
without the selection condition of ‘.exe’. The query 
costs are moderate for large systems - 30% for m1M and 
ml0M (higher for the small system m1l00K). We also 


tested SUM and AVG queries with other selection condi- 
tions (e.g., file type = ‘.d/l’’) and found similar results. 


5.4 Top-k Queries 


To evaluate the performance of FS_TopK, we considered 
both Q5 and Q6 discussed in Section 2. For QS, i.e., the k 
largest files, we tested Glance over five file systems, with 
k; being 50 or 100. One can see from the results depicted 
in Figure 7 that, in all but one case (m1M), Glance is 
capable of locating at least 50% of all top-k files (for pb, 
more than 95% are located). Meanwhile, the cost is as 
little as 4% of crawling (for m10M). Figure 8 presents 
the runtimes of the top-& queries, where one can see that 
similar to aggregate queries, the runtime is correlated to 
the size of the file system - the queries take only a few 
seconds for small file systems, and up to ten minutes for 
large systems (e.g., m10M). 



































cs 08/7 
FA 0.6 - 
3° 0.4 he | 
<q 0.2+- 
33 tt 
1.0 
0.8 - 
@ 0O6F- 
& 04/4 
0.2 -~ 
0.0 
m100K miM m10M pb pe 
File Systems 


Figure 7: Accuracy and cost of Top-k queries on file size 


Figure 9 presents the query accuracy and cost for Top- 
k; queries on file size, when y varies from 1, 5, 10, to 
100,000. The trend is clear - the query cost increases as 
7 does, because a higher value of + is to scale the highest- 
score estimation up to a larger degree, that is, to crawl a 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


227 


228 
































1.0 
> 
© 0.8 
£ 
5 
2 0.6 
<x 

0.4 

0 100 200 300 400 500 600 
Time (sec) 
+ m100KX miM_ * m10M E pb Hi pe 























Figure 8: Top-k query accuracy vs. run time in seconds. 
The first point of each line stands for top-50 and the sec- 
ond for top-100. 


larger portion of the file system. Fortunately, a moderate 
y of 5 and 10 presents a good tradeoff point - achieving 
a reasonable accuracy without incurring too much cost. 
We also tested Q6, i.e., the & most recently modified 
files over m100K, m1M, and pb. The results are shown 
in Figure 10. One can see that Glance is capable of locat- 
ing more than 90% of top-k files for pb, and about 60% 
for ml100K and m1M. The cost, meanwhile, is 28% of 
crawling for m100K, 1% for m1M, and 36% for pb. 
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6 Related Work 


Metadata query on file systems: Prior research on file- 
system metadata query [26, 32] has extensively focused 
on databases, which utilizes indexes on file metadata. 
However, the results [26,3 1,32] reviewed the inefficiency 
of this paradigm due to metadata locality and distribution 
skewness in large file systems. To solve this problem, 
Spyglass [30, 32], SmartStore [26], and Magellan [31] 
utilize multi-demensional structures (e.g., K-D trees and 
R-trees) to build indexes upon subtree partitions or se- 
mantic groups. SmartStore attempts to reorganize the 
files based on their metadata semantics. Conversely, 
Glance avoids any file-specific optimizations, aiming in- 
stead to maintain file system agnosticism. It works seam- 
lessly with the tree structure of a file system and avoids 
the time and space overheads from building and main- 
taining the metadata indexes. 


Comparison with Database Sampling: Traditionally 
database sampling has been used to reduce the cost of 
retrieving data from a DBMS. Random sampling mech- 
anisms have been extensively studied [4, 6,9, 12, 14, 15, 
22,34]. Applications of random sampling include esti- 
mation methodologies for histograms and approximate 
query processing (see tutorial in [15]). However, these 
techniques do not apply when there is no direct random 
access to all elements of interest - e.g., in a file system, 
where there is no complete list of all files/directories. 

Another particularly relevant topic is the sampling of 
hidden web databases [8, 24, 25,28], for which a random 
descent process has been used to construct queries is- 
sued over the web interfaces of these databases [16-19]. 
While both these techniques and Glance use random de- 
scents, a unique challenge for sampling a file system is its 
much more complex distribution of files. If we consider 
a hidden database in the context of a file system, then 
all files (i.e., tuples) appear under folders with no sub- 
folders. Thus, the complex distribution of files in a file 
system calls for a different sampling technique which we 
present in the paper . 


Top-k Query Processing: Top-k query processing has 
been extensively studied over both databases (e.g., see a 
recent survey [27]) and file systems [3,7,26,32]. For file 
systems, a popular application is to locate the top-k most 
frequent (or space-consuming) files/blocks for redun- 
dancy detection and removal. For example, Lillibridge 
et al. [33] proposed the construction of an in-memory 
sparse index to compare an incoming block against a few 
(most similar) previously stored blocks for duplicate de- 
tections (which can be understood as a top-k query with 
a scoring function of similarity). Top-& query process- 
ing has also been discussed in other index-building tech- 
niques, e.g., in Spyglass [32] and SmartStore [26]. 
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7 Discussion 


At present, Glance takes several pre-defined parameters 
as the inputs and needs to complete the execution in 
whole. That is, Glance is not an any-time algorithm and 
cannot be stopped in the middle of the execution, because 
our current approach relies on a complete sample to re- 
duce query variance and achieve high accuracy. One lim- 
itation of this approach is that its runtime over an alien 
file system is unknown in advance, making it unsuitable 
for the applications with absolute time constraints. For 
example, a border patrol agent may need to count the 
amount of encrypted files in a traveler’s hard drive, in 
order to determine whether the traveler could be trans- 
porting sensitive documents across the border [13, 44]. 
In this case, the agent must make a fast decision as the 
amount of time each traveler can be detained for is ex- 
tremely limited. We envision that in the future Glance 
shall offer a time-out knob that a user can use to decide 
the query time over a file system. This calls for new algo- 
rithms that allow Glance get smarter - be predictive about 
the run-time and self-adjust the work flow based on the 
real-time requirements. 

Glance currently employs a ’’static” strategy over file 
systems and queries, i.e., it does not modify its tech- 
niques and traversals for a query. A dynamic approach 
is attractive because in that case Glance would be able to 
adjust the algorithms and parameters depending on the 
current query and file system. New sampling techniques, 
e.g., stratified and weighted sampling, shall be investi- 
gated to further improve query accuracy on large file sys- 
tems. The semantic knowledge of a file system can also 
help in this approach. For example, most images can 
be found in a special directory, e.g. “/User/Pictures/” in 
MacOS X, or “\Documents and Settings\User\My Doc- 
uments\My Pictures\” in Windows XP. 

Glance shall also leverage the results from the pre- 
vious queries to significantly expedite the future ones, 
which is beneficial in situations when the workload is a 
set of queries that are executed very infrequently. The 
basic idea is to store the previous estimations over parts 
(e.g., subtrees) of the file system, and utilize the history 
to limit the search space to the previously unexplored 
part of the file system, unless it determines that the his- 
tory is obsolete (e.g., according to a pre-defined valid- 
ity period). Note that the history shall be continuously 
updated to include newly discovered directories and to 
update the existing estimations. 


8 Conclusion 


In this paper we have initiated an investigation of just- 
in-time analytics over a large-scale file system through 
its tree- or DAG-like structure. We proposed a ran- 


dom descent technique to produce unbiased estimations 
for SUM and COUNT queries and accurate estimations 
for other aggregate queries, and a pruning-based tech- 
nique for the approximate processing of top-k queries. 
We proposed two improvements, high-level crawling and 
breadth-first descent, and described a comprehensive set 
of experiments which demonstrate the effectiveness of 
our approach over real-world file systems. 
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Abstract 


We present Anticipatory Memory Allocation (AMA), a 
new method to build kernel code that is robust to memory- 
allocation failures. AMA avoids the usual difficulties in 
handling allocation failures through a novel combination 
of static and dynamic techniques. Specifically, a devel- 
oper, with assistance from AMA static analysis tools, de- 
termines how much memory a particular call into a kernel 
subsystem will need, and then pre-allocates said amount 
immediately upon entry to the kernel; subsequent alloca- 
tion requests are serviced from the pre-allocated pool and 
thus guaranteed never to fail. We describe the static and 
run-time components of AMA, and then present a thor- 
ough evaluation of Linux ext2-mfr, a case study in which 
we transform the Linux ext2 file system into a memory- 
failure robust version of itself. Experiments reveal that 
ext2-mfr avoids memory-allocation failures successfully 
while incurring little space or time overhead. 


1 Introduction 


A great deal of recent activity in systems research has fo- 
cused on new techniques for finding bugs in large code 
bases [13, 16, 17,20, 24, 26,38]. Whether using static 
analysis [16,20], model checking [25,40], symbolic ex- 
ecution [10,39], machine learning [24], or other testing- 
based techniques [3,4,31], all seem to agree: there are 
hundreds of bugs in commonly-used systems. 

One important class of software defect is found in re- 
covery code, i.e., code that is run in reaction to failure. 
These failures, whether from hardware (e.g., a disk) or 
software (e.g., a memory allocation), tend to occur quite 
rarely in practice, but the correctness of the recovery code 
is critical. For example, Yang et al. found a large num- 
ber of bugs in file-system recovery code; when such bugs 
were triggered, the results were often catastrophic, result- 
ing in data corruption or unmountable file systems [40]. 
Recovery code has the worst possible property: it is rarely 
run, but absolutely must work correctly. 

Memory-allocation failure serves as an excellent and 
important example of the recovery-code phenomenon. 
Woven throughout a complex system such as Linux are 
memory allocations of various flavors (e.g., kmalloc, 


kmem_cache-_alloc, etc.) in conjunction with small 
snippets of recovery code to handle those rare cases 
when a memory allocation fails. As previous work has 
shown [17,28, 40], and as we further demonstrate in this 
paper (§2), this recovery code does not work very well, 
often crashing the system or worse when run. 

Thus, in this paper, we take a different approach to solv- 
ing the problem presented by memory-allocation failures. 
We follow one simple mantra: the most robust recovery 
code is recovery code that never runs at all. 

Our approach is called Anticipatory Memory Allocation 
(AMA). The basic idea behind AMA is simple. First, us- 
ing both a static analysis tool plus domain knowledge, the 
developer determines a conservative estimate of the to- 
tal memory allocation demand of each call into the ker- 
nel subsystem of interest. Using this information, the de- 
veloper then augments their code to pre-allocate the req- 
uisite amount of memory at run-time, immediately upon 
entry into the kernel subsystem. The AMA run-time then 
transparently redirects existing memory-allocation calls to 
use memory from the pre-allocated chunk. Thus, when a 
memory allocation takes place deep in the heart of the ker- 
nel subsystem, it is guaranteed never to fail. 

With AMA, kernel code is written naturally, with mem- 
ory allocations inserted wherever the developer needs 
them to be; however, with AMA, the developer need not 
be concerned with downstream memory-allocation fail- 
ures and the scattered (and often buggy) recovery code 
that would otherwise be required. Further, by allocat- 
ing memory in one large chunk upon entry, failure of the 
anticipatory pre-allocation is straightforward to handle; a 
uniform failure-handling policy (such as retry with expo- 
nential backoff) can trivially be implemented. 

To demonstrate the benefits of AMA, we apply it to 
the Linux ext2 file system to build a memory-failure ro- 
bust version of ext2 called ext2-mfr. File systems are 
one of the most critical components of the kernel, as they 
store persistent state, and bugs within the file system can 
lead to serious problems [40]; hence, they serve as an 
excellent case study for AMA (although much of AMA 
is generic and could be applied elsewhere in the kernel). 
Through experiment, we show that ext2-mfr is robust to 
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memory-allocation failure, and runs without noticeable 
performance or space overheads; key to the reduction in 
space overheads are two novel optimizations we intro- 
duce, cache peeking and page recycling. Further, very 
little code change is required, thus demonstrating the ease 
of transforming a significant subsystem. Overall, we find 
that AMA achieves its goals, and thus altogether avoids 
of one important class of recovery bug commonly found 
in kernel code. 

In our current prototype, the static analysis tool in 
AMA is semi-automated. AMA requires developer in- 
volvement at the last stage of the static analysis to com- 
pute the memory requirements for each call. More pro- 
gramming effort is required to fully automate the static 
analysis tool. Hence, in its current form, our AMA proto- 
type serves as a feasibility study of applying static analy- 
sis techniques inside operating systems to avoid a class of 
recovery code. 

The rest of this paper is structured as follows. We 
first present more background on Linux memory alloca- 
tion (§2), including a further study of how Linux file sys- 
tems react to memory failure. We then present the design 
and implementation of AMA (§3,84,85), and evaluate its 
robustness and performance (§6). Finally, we discuss re- 
lated work (§7) and conclude (§8). 


2 Background 


Before delving into the depths of AMA, we provide some 
background on kernel memory allocation. We first de- 
scribe the many different ways in which memory is ex- 
plicitly allocated within the kernel. Then, through fault 
injection, we show that many problems still exist in han- 
dling memory-allocation failures. Our discussion re- 
volves around the Linux kernel (with a focus on file sys- 
tems), although in our belief the issues that arise here 
likely exist in other modern operating systems. 


2.1 Linux Allocators 

2.1.1 Memory Zones 

At the lowest level of memory allocation within Linux 
is a buddy-based allocator of physical pages [7], 
with low-level routines such as alloc_pages() and 
free_pages() called to request and return pages, re- 
spectively. These functions serve as the basis for the al- 
locators used for kernel data structures (described below), 
although they can be called directly if so desired. 


2.1.2 Kernel Allocators 

Most dynamic memory requests in the kernel use the 
Linux slab allocator, which is based on Bonwick’s orig- 
inal slab allocator for Solaris [6] (a newer SLUB alloca- 
tor provides the same interfaces but is internally simpler). 
One simply calls the generic memory allocation routines 
kmalloc() and kfree( ) to use these facilities. 
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cache mempool alloc 

kmalloc alloc —vmalloc create pages 

btrfs 93 7 3 0 1 
ext2 8 1 0 0 0 
ext3 12 1 0 0 0 
ext4 26 10 1 0 0 
jfs 18 1 2 1 0 
reiser 17 1 5 0 0 
xfs 11 1 0 1 1 


Table 1: Usage of Different Allocators. The table shows the 
number of different memory allocators used within Linux file systems. 
Each column presents the number of times a particular routine is found 


in each file system. 


For objects that are particularly popular, specialized 
caches can be explicitly created. To create such a cache, 
one simply calls kmem_cache_create(), which (if 
successful) returns a reference to the newly-created object 
cache; subsequent calls to kmem_cache_alloc() are 
passed this reference and return memory for the specific 
object. Hundreds of these specialized allocation caches 
exist in a typical system (see /proc/slabinfo); a 
common usage for a file system, for example, is an inode 
cache. 

Beyond these commonly-used routines, there are a few 
other ways to request memory in Linux. A memory pool 
interface allows one to reserve memory for use in emer- 
gency situations. Finally, the virtual malloc interface re- 
quests in-kernel pages that are virtually (but not necessar- 
ily physically) contiguous. 

To demonstrate the diversity of allocator usage, we 
present a study of the popularity of these interfaces within 
a range of Linux file systems. We study file systems as 
they are an important and complex kernel subsystem, and 
one in which memory-allocation failure can lead to se- 
rious problems [40]. Table | presents our results. As 
one can see, although the generic interface kmalloc( ) 
is most popular, the other allocation routines are used as 
well. For kernel code to be robust, it must handle failures 
from all of these allocation routines. 


2.2 Failure Modes 


When calling into an allocator, flags determine the ex- 
act behavior of the allocator, particularly in response 
to failure. Of greatest import to us is the use of the 
_GFP_NOFAIL flag, which a developer can use when 
they know their code cannot handle an allocation failure; 
using the flag is the only way to guarantee that an alloca- 
tor will either return successfully or not return at all (..e., 
keep trying forever). However, this flag is rarely used. As 
lead Linux kernel developer Andrew Morton said [27]: 
“_GFP_NOFAIL should only be used when we have no 
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Process State File-System State 


Error Abort | Unusable Inconsistent 
btrfso 0 0 0 0 
btrfsi9 0 14 15 0 
btrfsso 0 15 15 0 
ext29 0 0 0 0 
ext219 10 5 5 0 
ext250 10 5 5 0 
ext3o 0 0 0 0 
ext319 10 5 5 4 
ext350 10 5 5 5 
ext4o 0 0 0 0 
ext4i9 10 5 5 5 
ext450 10 5 5 5 
jfso 0 0 0 0 
jfsio 15 0 2 5 
jfsso 15 0 5 5 
reiserfso 0 0 0 0 
reiserfs19 10 4 4 0 
reiserfs50 10 5 5 0 
xfso 0 0 0 0 
xfsi9 13 1 0 3 
xfs50 10 5 0 5 








Table 2: Fault Injection Results. The table shows the reac- 
tion of the Linux file systems to memory-allocation failures as the prob- 
ability of a failure increases. We randomly inject faults into the three 
most-used allocation calls: kmalloc(), kmem.cache_alloc(), 
and ..alloc.pages(). For each file system and each probability 
(shown as subscript), we run a micro benchmark 15 times and report 
the number of runs in which certain failures happen in each column. We 
categorize all failures into process state and file-system state, in which 
’Error’ means that file system operations fail (gracefully), ’Abort’ indi- 
cates that the process was terminated abnormally, ’Unusable’ means the 
file system is no longer accessible, and ’Inconsistent’ means file system 
metadata has been corrupted and data may have been lost. Ideally, we 
expect the file systems to gracefully handle the error (i.e., return error) 
or retry the failed allocation request. Aborting a process, inconsistent 
file-system state, and unusable file system are unacceptable actions on 


an memory allocation failure. 


way of recovering from failure. ... Actually, nothing in 
the kernel should be using __GFP_NOFAIL. It is there as a 
marker which says ’we really shouldn’t be doing this but 
we don’t know how to fix it’.” In all other uses of kernel 
allocators, failure is thus a distinct possibility. 


2.3 Bugs in Memory Allocation 
Earlier work has repeatedly found that memory-allocation 
failure is often mishandled [16,40]. In Yang et al.’s 
model-checking work, one key to finding bugs is to follow 
the code paths where memory-allocation has failed [40]. 
We now perform a brief study of memory-allocation 
failure handling within Linux file systems. We use fault 
injection to fail calls to the various memory allocators and 
determine how the code reacts as the number of such fail- 
ures increases. Our injection framework picks a certain 
allocation call (e.g., kmalloc()) within the code and 


empty _dir() 
if (...|| 


[file: 
!(bh = 


namei.c] 
ext4 bread(..., &err))) 


return 1; // XXX: should have returned 0 


ext4 rmdir() [file: namei.c] 
retval = -ENOTEMPTY; 
if (!empty_dir(inode) ) 
goto end_rmdir; 
retval = 
if (retval) 
goto end_rmdir; 


Figure 1: Improper Failure Propagation. The code shown 
in the figure is from the ext4 file system, and shows a case where a failed 
low-level allocation (in ext4_bread()) is not properly handled, which 


eventually leads to an inconsistent file system. 


fails it probabilistically; we then vary the probability and 
observe how the kernel reacts as an increasing percent- 
age of memory-allocation calls fail. Table 2 presents our 
results, which sums the failures seen in 15 runs per file 
system, while increasing the probability of an allocation 
request failing from 0% to 50% of the time. 


The table reports what happens as the probability of al- 
location failure occurring increases, from 0% (base case), 
to 10% and then 50% of calls. We report the outcomes 
in two categories: process state and file-system state. The 
process state results are further divided into two groups: 
the number of times (in 15 runs) that a running process 
received an error (such as ENOMEM), and the number 
of times that a process was terminated abnormally (ie., 
killed). The file system results are split into two categories 
as well: a count of the number of times that the file sys- 
tem became unusable (i.e., further use of the file system 
was not possible after the trial), and the number of times 
the file system became inconsistent as a result, possible 
losing user data. 


From the table, we can make the following observa- 
tions. First, we can see that even a simple, well-tested, 
and slowly-evolving file system such as Linux ext2 still 
does not handle memory-allocation failures very well; we 
take this as evidence that doing so is challenging. Second, 
we observe that all file systems have difficulty handling 
memory-allocation failure, often resulting in an unusable 
or inconsistent file system. 


An example of how a file-system inconsistency can 
arise is found in Figure 1. In this example, in try- 
ing to remove a directory (in ext4_rmdir()), the 
routine first checks if the directory is empty by call- 
ing empty-dir(). This routine, in turn, calls 
ext4.bread() to read the directory data. Unfortu- 
nately, due to our fault injection, ext4_bread() tries 
to allocate memory but fails to do so, and thus the call to 
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ext4_bread_() returns an error (correctly). The routine 
empty -dir() incorrectly propagates this error, simply 
returning a | and thus accidentally indicating that the di- 
rectory is empty and can be deleted. Deleting a non-empty 
directory not only leads to a hard-to-detect file-system in- 
consistency (despite the presence of journaling), but also 
could render inaccessible a large portion of the directory 
tree. 

Finally, a closer look at the code of some of these file 
systems reveals a third interesting fact: in a file system 
under active development (such as btrfs), there are many 
places within the code where memory-allocation failure 
is never checked for; our inspection thus far has yielded 
over 20 places within btrfs such as this. Such trivial mis- 
handling is rarer inside more mature file systems. 

Overall, our results hint at a broader problem, which 
matches intuition: developers write code as if memory 
allocation will never fail; only later do they (possibly) 
go through the code and attempt to “harden” it to handle 
the types of failures that might arise. Proper handling of 
such errors, as seen in the ext4 example, is a formidable 
task, and as a result, such hardening sometimes remains 
“softer” than desired. 


2.3.1 Summary 


Kernel memory allocation is complex, and handling fail- 
ures still proves challenging even for code that is relatively 
mature and generally stable. We believe these problems 
are fundamental given the way current systems are de- 
signed; specifically, to handle failure correctly, a deep re- 
covery must take place, where far downstream in the call 
path, one must either handle the failure, or propagate the 
error up to the appropriate error-handling location while 
concurrently making sure to unwind all state changes that 
have taken place on the way down the path. Earlier work 
has shown that the simple act of propagating an error cor- 
rectly in a complex file system is challenging [19]; doing 
so and correctly reverting all other state changes presents 
further challenges. Although deep recovery is possible, 
we believe it is usually quite hard, and thus error-prone. 
More sophisticated bug-finding tools could be built, and 
further bugs unveiled; however, to truly solve the problem, 
an alternate approach to deep recovery is likely required. 


3 Anticipatory Memory Allocation: 
An Overview 


We now present an overview of Anticipatory Memory AI- 
location (AMA), a novel approach to solve the memory- 
allocation failure-handling problem. The basic idea is 
simple: first, we analyze the code paths of a kernel sub- 
system to determine what their memory requirements are. 
Second, we augment the code with a call to pre-allocate 
the necessary amounts. Third, we transparently redi- 
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void f2() { 


void *p = malloc(100); 
£3(); 
} 
void £3() { 
void «*q = malloc(25); 
} 
int f1() { 
// AMA: Pre-allocate 100- and 25-byte chunks 
£2(); 
// AMA: Free any unused chunks 
} 


Figure 2: Simple AMA Example. The code presents a simple 
example of how AMA is used. In the unmodified case, routine £1() 
calls £2(), which calls £3(), each of which allocate some memory 
(and perhaps incorrectly handle their failure). With AMA, £1() pre- 
allocates the full amount needed; subsequent calls to allocate memory 
are transparently redirected to use the pre-allocated chunks instead of 


calling into the real allocators, and any remaining memory is freed. 


rect allocation requests during run-time to use the pre- 
allocated chunks of memory. 

Figure 2 shows a simple example of the transforma- 
tion. In the figure, a simple entry-point routine £1( ) calls 
one other downstream routine, £2 ( ), which in turn calls 
£3). Each of these routines allocates some memory dur- 
ing their normal execution, in this case 100 bytes by £2 ( ) 
and 25 bytes by £3(). 


With AMA, we analyze the code paths to discover the 
worst-case allocation possible; in this example, the anal- 
ysis would be simple, and the result is that two memory 
chunks, of size 100 and 25 bytes, are required. Then, be- 
fore calling into £2(), one should call into the anticipa- 
tory memory allocator to pre-allocate chunks of 100 and 
25 bytes. The modified run-time then redirects all down- 
stream allocation requests to use this pre-allocated pool. 
Thus the calls to allocate 100 and 25 bytes in £2() and 
£3( ) (respectively) will use memory already allocated by 
AMA, and are guaranteed not to fail. 

The advantages of this approach are many. First, 
memory-allocation failures never happen downstream, 
and thus there is no need to handle said failures; the com- 
plex unwinding of kernel state and error propagation are 
thus avoided entirely. Second, because allocation failure 
can only happen in only one place in the code (at the top), 
it is easy to provide a unified handling mechanism; for ex- 
ample, if the call to pre-allocate memory fails, the devel- 
oper could decide to immediately return a failure, retry, 
or perhaps implement a more sophisticated exponential 
backoff-and-retry approach, all excellent examples of the 
shallow recovery AMA enables. Third, very little code 
change is required; except for the calls to pre-allocate and 
perhaps free unused memory, the bulk of the code remains 
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void 
ext2_init_block_alloc_info(struct inode xinode) 
{ 
struct ext2_inode_info xei = 
struct ext2_block_alloc-_info *block_i 
ei—i_block_alloc_info; 
block.i = kmalloc (sizeof(*block_i), GFP_NOFS) ; 


EXT2_I (inode); 


Figure 3: A Simple Call. 


unmodified, as the run-time transparently redirects down- 
stream allocation requests to use the pre-allocated pool. 

Unfortunately, code in real systems is not as simple as 
that found in the figure, and indeed, the problem of deter- 
mining how much memory needs to be allocated given an 
entry point into a complex code base is generally unde- 
cidable. Thus, the bulk of our challenge is transforming 
the code and gaining certainty that we have done so cor- 
rectly and efficiently. To gain a better understanding of 
the problem, we must choose a subsystem to focus upon, 
and transform it to use AMA. 


3.1 A Case Study: Linux ext2-mfr 


The case study we use is the Linux ext2 file system. 
Although simpler than its modern journaling cousins, 
ext2 is a real file system and certainly has enough com- 
plex memory-allocation behavior (as described below) to 
demonstrate the intricacies of developing AMA for a real 
kernel subsystem. 

We describe our effort to transform the Linux ext2 file 
system into a memory-robust version of itself, which we 
call Linux ext2-mfr (i.e., a version of ext2 that is Memory- 
Failure Robust). In our current implementation, the trans- 
formation requires some human effort and is aided by a 
static analysis tool that we have developed. The process 
could be further automated, thus easing the development 
of other memory-robust file systems; we leave such efforts 
to future work. 

We now highlight the various types of allocation re- 
quests that are made, from simpler to more complex. By 
doing so, we are showing what work needs to be done 
to be able to correctly pre-allocate memory before calling 
into ext2 routines, and thus shedding light on the types 
of difficulties we encountered during the transformation 
process. 


3.1.1. Simple Calls 

Most of the memory-allocation calls made by the kernel 
are of a fixed size. Allocating file system objects such as 
dentry, file, inode, page have pre-determined sizes. For 
example, file systems often maintain a cache of inode ob- 
jects, and thus must have memory allocated for them be- 
fore being read from disk. Figure 3 shows one example of 
such a call from ext2. 
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struct dentry *d.alloc(..., struct qstr «name) { 
if (name—len > DNAME_INLINE_LEN-1) { 
dname = kmalloc(name—len + 1, GFP_KERNEL) ; 
if (!dname) 
return NULL; 


} 
} 

Figure 4: A Parameterized and Conditional Call. 
ext2_find_entry (struct inode « dir, ...) 
{ 

unsigned long npages = dir-_pages (dir); 

unsigned long n = 0; 

do { 

page = ext2_get_page(dir, n,..); // allocate a page 


if (ext2.-match entry (...)); 
goto found; 


nth: 
} while (n/=npages); // worst case: n = npages 


found: 
return entry; 


} 
Figure 5: Loop Calls. 


3.1.2 Parameterized and Conditional Calls 

Some allocated objects have variable lengths (e.g., a file 
name, extended attributes, and so forth) and the exact size 
of the of the allocation is determined at run-time; some- 
times allocations are not performed due to conditionals. 
Figure 4 shows how ext2 allocates memory for a directory 
entry, which uses a length field (plus one for the end-of- 
string marker) to request the proper amount of memory. 
This allocation is only performed if the name is too long 
and requires more space to hold it. 


3.1.3 Loops 

In many cases file systems allocate objects inside a loop 
or inside nested loops. In ext2, the upper bound of the 
loop execution is determined by the object passed to the 
individual calls. For example, allocating pages to search 
for directory entries are done inside a loop. Another good 
example is searching for a free block within the block 
bitmaps of the file system. Figure 5 shows the page al- 
location code during directory lookups in ext2. 


3.1.4 Function Calls 

Of course, a file system is spread across many functions, 
and hence any attempt to understand the total memory 
allocation of a call graph given an entry point must be 
able to follow all such paths, sometimes into other ma- 
jor kernel subsystems. For example, one memory allo- 
cation request in ext2 is invoked 21 calls deep; this ex- 
ample path starts at sys_open, traverses through some 
link-traversal and lookup code, and ends with a call to 
kmem_cache_alloc. 
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static void 
ext2_free_branches(struct inode «inode, 
-., int depth) { 
if (depth--) { 


// allocate a page and buffer head 
bh = sb._bread(inode—i-sb, ..); 


ext2_free_branches (inode, 
(--le32*) bh—b.data, 
(--le32*) bh—b.data + 
addr_per_block, 
depth) ; 
} else 
ext2_free.data(inode, ...); 


Figure 6: Recursion. 


3.1.5 Recursions 

A final example of an in-kernel memory allocation is one 
that is performed within a recursive call. Some portions of 
file systems are naturally recursive (e.g., pathname traver- 
sal), and thus perhaps it is no surprise that recursion is 
commonplace. Figure 6 shows the block-freeing code that 
is called when a file is truncated or removed in ext2; in the 
example, ext2_free_branches calls itself to recurse 
down indirect-block chains and free blocks as need be. 


3.2. Summary 

To be able to pre-allocate enough memory for a call, 
one must handle parameterized calls, conditionals, loops, 
function calls, and recursion. If file systems only con- 
tained simple allocations and minimal amounts of code, 
pre-allocation would be rather straightforward. The rele- 
vant portion of the call graph for ext2 (and all related com- 
ponents of the kernel) contains nearly 2000 nodes (one per 
relevant function) and roughly 7000 edges (calls between 
functions) representing roughly 180,000 lines of kernel 
source code. Even for a relatively-simple file system such 
as ext2, the task of manually computing the pre-allocation 
amount would be daunting, without automated assistance. 


4 The Static Transformation: 


From ext2 to ext2-mfr 


We now present the static-analysis portion of AMA, in 
which we develop a tool, the AMAlyzer, to help decide 
how much memory to pre-allocate at each entry point 
into the kernel subsystem that is being transformed (in 
this case, Linux ext2). The AMAlyzer takes in the entire 
relevant call graph and produces a skeletal version, from 
which the developer can derive the proper pre-allocation 
amounts. After describing the tool, we also present two 
novel optimizations we employ, cache peeking and page 
recycling, to reduce memory demands. We end the section 
with a discussion of the limits of our current approach. 
We build the AMAlyzer on top of CIL [29], a tool 
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which allows us to readily analyze kernel source code. 
CIL does not resolve function pointers automatically, 
which we require for our complete call graph, and hence 
we perform a small amount of extra work to ensure we 
cover all calls made in the context of the file system; be- 
cause of the limited and stylized use of function point- 
ers within the kernel, this process is straightforward. The 
AMAlyzer in its current form is comprised of a few thou- 
sand lines of OCaml code. 


4.1 The AMAlyzer 

We now describe the AMAlyzer in more detail, which 
consists of two phases. In the first phase, the tool searches 
through the entire subsystem to construct the allocation- 
relevant call graph, i.e., the complete set of downstream 
functions that contain kernel memory-allocation requests. 
In the second phase, a more complex analysis determines 
which variables and state are relevant to allocation calls, 
and prunes away other irrelevant code. The result is a 
skeletal form of the subsystem in question, from which 
the pre-allocation amounts are readily derived. 


4.1.1 Phase 1: Allocation-Relevant Call Graph 

The first step of our analysis prunes the entire call graph, 
which, as we have seen, is quite large, and generates what 
we tefer to as the allocation-relevant call graph (ARCG). 
The ARCG contains only nodes and edges in which a 
memory allocation occurs, either within a node of the 
graph or somewhere downstream of it. 

We perform a Depth First Search (DFS) on the call 
graph to generate ARCG. An additional attribute namely 
calls_memory-allocation is added to each node (i.e., func- 
tion) in the call graph to speed up the ARCG gen- 
eration. The calls-:memory- allocation attribute is set 
on two occasions. First, when a memory allocation 
routine is encountered during the DFS. Second, the 
calls_memory-allocation attribute is set if at least one of 
the node’s children has its calls-memory-allocation at- 
tribute set. 

At the end of the DFS, the functions that do not have 
calls_memory- allocation attribute set are safely deleted 
from the call graph. The remaining nodes in the call graph 
constitute the ARCG. 


4.1.2 Phase 2: Loops and Recursion 

At this point, the tool has reduced the number of functions 
that must be examined. In this part of the analysis, we add 
logic to handle loops and recursions, and where possible, 
to help identify their termination conditions. The AM- 
Alyzer searches for all for, while, and goto-based 
loops, and walks through each function within such a loop 
to find either direct calls to kernel memory allocators or 
indirect calls through other routines. To identify goto- 
based loops, AMA uses the line numbers of the labels that 
the goto statements point to. To identify both recursions 
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+ Buf ferHead) + sizeof (ext2_block_allocinfo) 





Entry point | Pre-allocation required 
truncate() (Worst( Bitmap) + Worst(Indirect)) x (PageSize + Buf ferHead) 
lookup() (1 + Size(ParentDir)) x (PageSize + Buf ferHead) + Inode + Dentry + NameLength+ 
NamesCache 
lookuphash() | (1 + Size(ParentDir)) x (PageSize + Buf ferHead) + Inode + Dentry + NameLength + Filp 
sysopen() lookup() + lookuphash() + (4+ Depth(Inode) + Worst(Bitmap)) x PageSize+ 
(5 + Depth(Inode) + Worst(Bitmap)) x Buf ferHead + Inode + truncate() 
sysread() (count + ReadAhead + Worst( Bitmap) + Worst(Indirect)) x (PageSize + Buf fer Head) 
syswrite() (count + Worst(Bitmap)) x (PageSize 4 
mkdir() lookup() + lookuphash() + (Depth(ParentInode) + 4) x PageSize+ 
(Depth(Inode) + 8) x Buf ferHead 
unlink() lookup() + lookuphash() + (1 + Depth(Inode)) x (PageSize + Buf ferHead) 
rmdir() lookup() + lookuphash() + (3 + Depth(Inode)) x (PageSize + Buf ferHead) 
access() lookup() + NamesCache 
chdir() lookup() + NamesCache 
chroot() lookup() + NamesCache 
statfs() lookup() + NamesCache 








Table 3: Pre-Allocation Requirements for ext2-mfr. The table shows the worst-case memory requirements of the various system 
calls in terms of the kmem_cache, kmalloc, and page allocations. The following types of kmem-cache are used: NamesCache (4096 bytes), 
Buf fer Head (52 bytes), Inode (476 bytes), Filp (128 bytes), and Dentry (132 bytes). The PageSize is constant at 4096 bytes. The other 


terms used above include: Count: the number of blocks read/written, 


ReadAhead: the number of read-ahead blocks, W or st( Bitmap): 


the number of bitmap blocks that needs to be read, Worst(Indirect): the number of indirect blocks to be read for that particular block, 


Depth(inode): the maximum number of indirect blocks to be read for that particular inode, and Size(inode): the number of pages in the 


inode. 


and function-call based loops, AMA performs a DFS on 
the ARCG and for every function encountered during the 
search, it checks if the function has been explored before. 
Once these loops are identified, the tool searches for and 
outputs the expressions that affect termination. 


4.1.3 Phase 3: Slicing and Backtracking 
The goal of this next step is to perform a bottom-up crawl 
of the graph, and produce a minimized call graph with 
only the memory-relevant code left therein. We use a form 
of backward slicing [37] to achieve this end. 

In our current prototype, the AMAlyzer only performs 
a bottom-up craw] until the beginning of each function. In 
other words, the slicing is done at the function level and 
developer involvement is required to perform backtrack- 
ing. To backtrack until the beginning of a system call, 
the developer has to manually use the output of slicing 
for each function (including the dependent input variables 
that affect the allocation size/count) and invoke the slic- 
ing routine on its caller functions. The caller functions 
are identified using the ARCG. 


4.2 AMAlyzer Summary 


As we mentioned above, the final output is a skeletal 
graph which can be used by the developer to arrive at 
the final pre-allocations with the help of slicing support 
in the AMAlyzer. For ext2-mfr, the reduction in code is 
dramatic: from nearly 200,000 lines of code across 2000 
functions (7000 function calls) down to less than 9,000 
lines across 300 functions (400 function calls), with all 


relevant variables highlighted. Arriving upon the final 
pre-allocation amounts then becomes a straightforward 
process. 

Table 3 summarizes the results of our efforts. In the 
table, we present the parameterized memory amounts that 
must be pre-allocated for the 13 most-relevant entry points 
into the file system. 


4.3 Optimizations 

As we transformed ext2 into ext2-mfr, we noticed a num- 
ber of opportunities for optimization, in which we could 
reduce the amount of memory pre-allocated along some 
paths. We now describe two novel optimizations. 


4.3.1 Cache Peeking 

The first optimization, cache peeking, can greatly reduce 
the amount of pre-allocated memory. An example is 
found in code paths that access a file block (such as a 
sys_read()). To access a file block in a large file, it 
is possible that a triple-indirect, double-indirect, and in- 
direct block, inode, and other blocks may need to be ac- 
cessed to find the address of the desired block and read it 
from disk. 

With repeated access to a file, such blocks are likely to 
be in the page cache. However, the pre-allocation code 
must account for the worst case, and thus in the normal 
case must pre-allocate memory to potentially read those 
blocks. This pre-allocation is often a waste, as the blocks 
will be allocated, remain unused during the call, and then 
finally be freed by AMA. 
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With cache peeking, the pre-allocation code performs a 
small amount of extra work to determine if the requisite 
pages are already in cache. If so, it pins them there and 
avoids the pre-allocation altogether; upon completion, the 
pages are unpinned. 

The pin/unpin is required for this optimization to be 
safe. Without this step, it would be possible that a page 
gets evicted from the cache after the pre-allocation phase 
but before the use of the page, which would lead to an 
unexpected memory allocation request downstream. In 
this case, if the request then failed, AMA would not have 
served its function in ensuring that no downstream failures 
occur. 

Cache peeking works well in many instances as the 
cached data is accessible at the beginning of a system call 
and does not require any new memory allocations. Even 
if cache peeking requires additional memory, the memory 
allocation calls needed for cache peeking can be easily 
performed as part of the pre-allocation phase. 


4.3.2 Page Recycling 

A second optimization we came upon was the notion of 
page recycling. The idea for the optimization arose when 
we discovered that ext2 often uses far more pages than 
needed for certain tasks (such as file/directory truncates, 
searches on free/allocated entries inside block bitmaps 
and large directories). 

For example, consider truncate. In order to truncate a 
file, one must read every indirect block (and double in- 
direct block, and so forth) into memory to know which 
blocks to free. In ext2, each indirect block is read into 
memory and given its own page; the page holding an in- 
direct block is quickly discarded, after ext2 has freed the 
blocks pointed to by that indirect block. 

To reduce this cost, we implement page recycling. With 
this approach, the pre-allocation phase allocates the mini- 
mal number of pages that need to be in memory during the 
operation. For a truncate, this number is proportional to 
the depth of the indirect-block tree, instead of the size of 
the entire tree. Instead of allocating thousands of blocks 
to truncate a file, we only allocate a few (for the triple- 
indirect, a double indirect, and an indirect block). When 
the code has finished freeing the current indirect block, 
we recycle that page for the next indirect block instead 
of adding the page back to the LRU page cache, and so 
forth. In this manner, substantial savings in memory is 
made possible. 


4.4 Limitations and Discussion 
We now discuss some of the limitations of our anticipa- 
tory approach. 

Not all pieces are yet automated; instead, the tool cur- 
rently helps turn the intractable problem of examining 
180,000 lines of code into a tractable one providing a 
lot of assistance in finding the correct pre-allocations. 
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Further work is required in slicing and backtracking to 
streamline this process, but is not the focus of our current 
effort: rather our goal here is to demonstrate the feasibility 
of the anticipatory approach. 

The anticipatory approach could fail requests in cases 
where normal execution would successfully complete. 
Normal execution need not always take the worst case (or 
longest) path. As a result, it might be able to complete 
with fewer memory allocations than the anticipatory ap- 
proach. In contrast, anticipatory approach has to always 
allocate memory for the worst case scenario, as it cannot 
afford to fail on a memory allocation call after the pre- 
allocation phase. 

Cache peeking can only be used when sufficient infor- 
mation is available at the time of allocation to determine 
if the required data is in the cache. Sufficient informa- 
tion is available for file systems at the beginning of a sys- 
tem call in the context of file/directory reads and lookup 
of file-system objects, this allows cache peeking to avoid 
pre-allocation with little implementation effort. More im- 
plementation effort could be required in other systems to 
help determine if the required data is in its cache. 


5 The AMA Run-Time 


The final piece of AMA is the runtime component. There 
are two major pieces to consider. First is the pre- 
allocation itself, which is inserted at every relevant en- 
try point in the kernel subsystem of interest, and subse- 
quent cleanup of pre-allocated memory. Second is the 
use of the pre-allocated memory, in which the run-time 
must transparently redirect allocation requests (such as 
kmalloc()) to use the pre-allocated memory. We dis- 
cuss these in turn, and then present the other run-time de- 
cision a file system such as Linux ext2-mfr must make: 
what to do when a pre-allocation request fails? 


5.1 Pre-allocating and Freeing Memory 

To add pre-allocation to a specific file system, we require 
that the file system to implement a single new VFS-level 
call, which we call vEs_get_mem_requirements(). 
This call takes as arguments information about which call 
is about to be made, any relevant arguments about the cur- 
rent operation (such as the file position or bytes to be read) 
and state of the file system, and then returns a structure to 
the caller (in this case, the VFS layer) which describes 
all of the necessary allocations that must take place. The 
structure is referred to as the anticipatory allocation de- 
scription (AAD). 

The VFS layer unpacks the AAD, allocates memory 
chunks (perhaps using different allocators) as need be, and 
links them into the task structure of the calling process 
for downstream use (described further below). With the 
pre-allocated memory in place, the VFS layer then calls 
the desired routine (such as vf£s_read()), which then 
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loff_t pos = file_pos_read(file); 

AMA.CHECK_AND_ALLOCATKE (file, 
AMA.SYS_READ, pos, count); 

ret = vfs_read(file, buf, count, &pos); 

file_pos_write(file, pos); 

AMA_CLEANUPQ; 


Figure 7: A VFS Read Example. 


utilizes the pre-allocated memory during its execution. 
When the operation completes, a generic AMA cleanup 
routine is called to free any unused memory. 

To give a better sense of this code flow, we provide a 
simplified example from the read() system call code 
path in Figure 7. Without the AMA additions, the code 
simply looks up the current file position (i.e., where to 
read from next), calls into vfs_read() to do the file- 
system-specific read, updates the file offset, and returns. 
As described in the original VFS paper [23], this code is 
generic across all file systems. 

With AMA, two extra steps are required, as shown in 
the figure. First, before calling into the vfs_read( ) 
call, the VFS layer now checks if the underlying file 
system is using AMA, and if so, calls the file system’s 
vfs_get_mem_requirements() routine to deter- 
mine the pending call’s memory requirements, and finally 
allocates the needed memory. All of this work is neatly 
encapsulated by the AMA_CHECK_AND_ALLOCATE ( ) 
call in the figure. 

Second, after the call is complete, a cleanup routine 
AMA_CLEANUP ( ) is called. This call is required because 
the AMAlyzer provides us with a worst-case estimate of 
possible memory usage, and hence not all pre-allocated 
memory is used during the course of a typical call into the 
file system. In order to free this unused memory, the extra 
call to AMA_CLEANUP ( ) is made. 


5.2 Using Pre-allocated Memory 

Central to our implementation is transparency; we do not 
change the specific file system (ext2) or other kernel code 
to explicitly use or free pre-allocated memory. File sys- 
tems and the rest of the kernel thus continue to use regular 
memory-allocation routines. 

To support this transparency, we modified each of the 
kernel allocation routines as follows. Specifically, when 
a process calls into ext2-mfr, the pre-allocation code (in 
AMA_CHECK_AND_ALLOCATE( ) above) sets a new flag 
within the per-task task structure. This anticipatory flag 
is then checked upon each entry into any kernel memory- 
allocation routine. If the flag is set, the routine attempts 
to use pre-allocated memory and if so completes by re- 
turning one of the pre-allocated chunks; if the flag is not 
set, the normal allocation code is executed (and failure 
is a possibility). Calls to kfree() and other memory- 
releasing routines operate as normal, and thus we leave 


those unchanged. 

Allocation requests are matched with the pre-allocated 
objects using the parameters passed to the allocation call 
at runtime. The parameters passed to the allocation call 
are size, order or the cachep pointer and the GFP flag. The 
type of the desired memory object is inferred through the 
invocation of the allocation call at runtime. The size (for 
kmalloc and vmalloc) or order (for alloc_pages) helps to 
exactly match the allocation request with the pre-allocated 
object. For cache objects, the cachep pointer help identify 
the correct pre-allocated object. 

One small complication arises during interrupt han- 
dling. Specifically, we do not wish to redirect memory 
allocation requests to use pre-allocated memory when re- 
quested by interrupt-handling code. Thus, when inter- 
rupted, we take care to save the anticipatory flag of the 
currently-running process and restore it when the inter- 
rupt handling is complete. 


5.3. What If Pre-Allocation Fails? 

Adding the pre-allocation into the code raises a new pol- 
icy question: how should the code handle the failure of 
the pre-allocation itself? We believe there are a number of 
different policy alternatives, which we now describe: 


¢ Fail-immediate. This policy immediately returns an 
error to the caller (such as ENOMEM). 


Retry-forever (with back-off). This policy simply 
keeps retrying forever, perhaps inserting a delay of 
some kind (e.g., exponential) between retry requests 
to reduce the load on the system and control better 
the load on the memory system. 


Retry-alternate (with back-off). This form of retry 
also requests memory again, but uses an alternate 
code path that uses less memory than the original 
through page/memory recycling and thus is more 
likely to succeed. This retry can also back-off as 
need be. 


Using AMA to implement these policies is superior 
to the existing approach, as it enables shallow recovery, 
immediately upon entry into the subsystem. For exam- 
ple, consider the fail-immediate option above. Clearly 
this policy could be implemented in the traditional system 
without AMA, but in our opinion doing so is prohibitively 
complex. To do so, one would have to ensure that the fail- 
ure was propagated correctly all the way through the many 
layers of the file system code, which is difficult [19,34]. 
Further, any locks acquired or other state changes made 
would have to be undone. Deep recovery is difficult and 
error-prone; shallow recovery is the opposite. 

Another benefit that the shallow recovery of AMA per- 
mits is a unified policy. The policy, whether failing imme- 
diately, retrying, or some combination, is specified in one 
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Process State 


File-System State 





Error Abort | Unusable Inconsistent 
ext2-mfrio 0 0 0 0 
ext2-mfrso 0 0 0 0 
ext2-mfrg9 0 0 0 0 


Table 4: Fault Injection Results: Retry. The table shows 
the reaction of the Linux ext2-mfr file system to memory failures as the 
probability of a failure increases. The file system uses a “retry-forever” 
policy to handle each failure. A detailed description of the experiment is 
found in Table 2. 


or a few places in the code. Thus, the developer can easily 
decide how the system should handle such a failure and be 
confident that the implementation meets that desire. 

A third benefit of our approach: file systems could 
expose some control over the policy to applications. 
Whereas most applications may not be prepared to han- 
dle such a failure, a more savvy application (such as a file 
server or database) could set the file system to fail-fast and 
thus enable better control over failure handling. 

Pre-allocation failure is not a panacea, however. De- 
pending on the installation and environment, the code 
that handles pre-allocation failures will possibly run quite 
rarely, and thus may not be as robust as normal-case code. 
Although we believe this to be less of a concern for pre- 
allocation recovery code (because it is small, simple, and 
usually correct “by inspection”), further efforts could be 
applied to harden this code. For example, some have sug- 
gested constant “fire drilling” [9] as a way to ensure oper- 
ators are prepared to handle failures; similarly, one could 
regularly fail kernel subsystems (such as memory alloca- 
tors) to ensure that this recovery code is run. 


6 Analysis 


We now analyze Linux ext2-mfr. We measure its ro- 
bustness under memory-allocation failure, as well as its 
baseline performance. We further study its space over- 
heads, exploring cases where our estimates of memory- 
allocation needs could be overly conservative, and 
whether the optimizations introduced earlier are effective 
in reducing these overheads. All experiments were per- 
formed on a 2.2 GHz Opteron processor, with two 80GB 
WDC disks, 2GB of memory, running Linux 2.6.32. We 
also experimented with the ramfs file system and were 
able to get similar performance results and better space 
overheads (not shown in the evaluation results). 


6.1 Robustness 

Our first experiment with ext2-mfr reprises our earlier 
fault injection study found in Table 2. In this experiment, 
we vary the probability that the memory-allocation rou- 
tines will fail from 10% all the way to 99%, and observe 
how ext2-mfr behaves both in terms of how processes 
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ext2 ext2-mfr 

Workload (secs) (secs) 

Sequential Write 13.46 13.69 (1.02x) 
Sequential Read 9.04 9.05 (1.01x) 
Random Writes 11.58 11.67 (1.01x) 
Random Reads 146.33 151.03 (1.03x) 
Sort 129.64 136.50 (1.05x) 
OpenSSH 48.30 49.80 (1.03x) 
PostMark 55.90 59.60 (1.07x) 


Table 5: Baseline Performance. The baseline performance of 
ext2 and ext2-mfr are compared. The first four tests are microbench- 
marks: sequential read and write either read or write 1-GB file in its 
entirety; random read and write read or write 100 MB of data over a 1- 
GB file. Note that random-write performance is good because the writes 
are buffered and thus can be scheduled when written to disk. The three 
application-level benchmarks: are a command-line sort of a 1OOMB 
text file; the OpenSSH benchmark which copies, untars, configures, and 
builds the OpenSSH 4.5.1 source code; and the PostMark benchmark run 
for 60,000 transactions over 3000 files (from 4KB to 4MB) with 50/50 
read/append and create/delete biases. All times are reported in seconds, 


and are stable across repeated runs. 


were affected as well as the overall file-system state. For 
this experiment, the retry-forever (without any back-off) 
policy is used. Table 4 reports our results. 

As one can see from the table, ext2-mfr is highly robust 
to memory allocation failure. Even when 99 out of 100 
memory-allocation calls fail, ext2-mfr is able to retry and 
eventually make progress. No application notices that the 
failures are occurring, and file system usability and state 
remain intact. 


6.2 Performance 

In our next experiment, we study the performance over- 
heads of using AMA. We utilize both simple microbench- 
marks as well as application-level tests to gauge the over- 
heads incurred in ext2-mfr due to the extra work of mem- 
ory pre-allocation and cleanup. Table 5 presents the re- 
sults of our study. 

From the table, we can see that the performance of our 
relatively-untuned prototype is excellent across both mi- 
crobenchmarks as well as application-level workloads. In 
all cases, the extra work done by the AMA runtime to 
pre-allocate memory, redirect allocation requests trans- 
parently, and subsequently free unused memory has a 
minimal cost. With further streamlining, we feel confi- 
dent that the overheads could be reduced even further. 


6.3 Space Overheads and Cache Peeking 

We now study the space overheads of ext2-mfr, both with 
and without our cache-peeking optimization. The largest 
concern we have about conservative pre-allocation is that 
excess memory may be allocated and then freed; although 
we have shown there is little time overhead involved (Ta- 
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ext2-mfr 





ext2 ext2-mfr (+peek) 

Workload (GB) (GB) (GB) 

Sequential Read 1.00 6.89 (6.87x) 1.00 (1.00x) 
Sequential Write 1.01 1.01 (1.00x) 1.01 (1.00x) 
Random Read 0.26 0.63 (2.41x) 0.28 (1.08x) 
Random Write 0.10 0.10 (1.05x) 0.10 (1.00x) 
PostMark 3.15 5.88 (1.87x) 3.28 (1.04x) 
Sort 0.10 0.10 (1.00x) 0.10 (1.00x) 
OpenSSH 0.02 1.56 (63.29x) 0.07 (3.50x) 


Table 6: Space Overheads. The total amount of memory allo- 
cated for both ext2 and ext2-mfr is shown. The workloads are identical 
to those described in the caption of Table 5. 


ble 5), the extra space requested could induce further 
memory pressure on the system, (ironically) making al- 
location failure more likely to occur. We run the same 
set of microbenchmarks and application-level workloads, 
and record information about how much memory was al- 
located for both ext2 and ext2-mfr; we also turn on and off 
cache-peeking for ext2-mfr. Table 6 presents our results. 

From the table, we make a number of observations. 
First, our unoptimized ext2-mfr does indeed conserva- 
tively pre-allocate a noticeable amount more memory than 
needed in some cases. For example, during a sequen- 
tial read of a 1 GB file, normal ext2 allocates roughly 
1 GB (mostly to hold the data pages), whereas unopti- 
mized ext2-mfr allocates nearly seven times that amount. 
The file is being read one 4-KB block at a time, which 
means on average, the normal scan allocates one block 
per read whereas ext2-mfr allocates seven. The reason 
for these excess pre-allocations is simple: when reading 
a block from a large file, it is possible that one would 
have to read in a double-indirect block, indirect block, and 
so forth. However, as those blocks are already in cache 
for these reads, the conservative pre-allocation performs a 
great deal of unnecessary work, allocating space for these 
blocks and then freeing them immediately after each read 
completes; the excess pages are not needed. 

With cache peeking enabled, the pre-allocation space 
overheads improve significantly, as virtually all blocks 
that are in cache need not be allocated. Cache peek- 
ing clearly makes the pre-allocation quite space-effective. 
The only workload which do not approach the minimum 
is OpenSSH. OpenSSH, however, places small demand on 
the memory system in general and hence is not of great 
concer. 


6.4 Page Recycling 
We also study the benefits of page recycling. In this exper- 
iment, we investigate the memory overheads of that arise 
during truncate. Figure 8 plots the results. 

In the figure, we compare the space overheads of stan- 
dard ext2, ext2-mfr (without cache peeking), and ext2-mfr 


Process State 


File-System State 





Error Abort | Unusable Inconsistent 
ext2-mfri9 15 0 0 0 
ext2-mfr50 15 0 0 0 
ext2-mfrg9 15 0 0 0 


Table 7: Fault Injection Results: Fail-Fast. The table 
shows the reaction of Linux ext2-mfr using a fail-fast policy file system. 


A detailed description of the experiment is found in Table 2. 
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Figure 8: Space Costs with Page Recycling. The figure 
shows the measured space overheads of page recycling during 
the truncate of a file. The file size is varied along the x-axis, and 
the space cost is plotted on the y-axis (both are log scales). 


with page recycling. As one can see from the figure, as 
the file system grows, the space overheads of both ext2 
and ext2-mfr converge, as numerous pages are allocated 
for indirect blocks. Page recycling obviates the need for 
these blocks, and thus uses many fewer pages than even 
standard ext2. 


6.5 Conservative Pre-allocation 


We also were interested in whether, despite our best ef- 
forts, ext2-mfr ever under-allocated memory in the pre- 
allocation phase. Thus, we ran our same set of work- 
loads and checked for this case. In no run during these 
experiments and other stress-tests did we ever encounter 
an under-allocation, giving us further confidence that our 
static transformation of ext2 was properly done. 


6.6 Policy Alternatives 


We also were interested in seeing how hard it is to use 
a different policy to react to allocation failures. Table 7 
shows the results of our fault-injection experiment, but 
this time with a “fail-fast” policy which immediately re- 
turns to the user should the pre-allocation attempt fail. 
The results show the expected outcome. In this case, 
the process running the workload immediately returns the 
ENOMEM error code; the file system remains consistent 
and usable. By changing only a few lines of code, an en- 
tirely different failure-handling behavior can be realized. 
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7 Related Work 


A large body of related work is found in the programming 
languages community on heap usage analysis, wherein 
researchers have developed static analyses to determine 
how much heap (or stack) space a program will use [1, 8, 
11, 12,21,22,35,36]. The general use-case suggested for 
said analyses is in the embedded domain, where memory 
and time resources are generally quite constrained [11]. 
Whereas many of the analyses focus on functional or 
garbage-collected languages, and thus are not directly ap- 
plicable to our problem domain, we do believe that some 
of the more recent work in this space could be applicable 
to anticipatory memory allocation. In particular, Chin et 
al.’s work on analyzing “low-level” code [11] and the live 
heap analysis implemented by Albert et al. [1] are promis- 
ing candidates for further automating the AMA transfor- 
mation process. 

The more general problem of handling “memory bugs” 
has also been investigated in great detail [2,5, 14,32, 33]; 
see Berger and Zorn for an excellent discussion of the 
range of common problems, including dangling pointers, 
double frees, and buffer overruns [5]. Many interesting 
and novel solutions have been proposed, including rolling 
back and trying again with a small change to the envi- 
ronment (e.g., more padding) [32], using multiple ran- 
domized heaps and voting to determine correctness [5], 
and even returning “made up” values when out-of-bounds 
memory is accessed [33]. The problem we tackle is both 
narrower and broader at once: narrower in that one could 
view the poor handling of an allocation failure as just one 
class of memory bug; broader in that true recovery from 
such a failure in a complex code base is quite intricate 
and reaches beyond the scope of typical solutions to these 
classic memory bugs. 

Our approach of using static analysis to predict 
memory-requirement is similar in spirit to that taken by 
Garbervetsky et al. [18]. Their approach helps to come 
up with estimates of memory allocation within a given re- 
gion. Moreover, their system does not consider the al- 
locations done by native methods or internal allocation 
performed by the runtime system, and do not handle re- 
cursive calls. In contrast, AMA comes with the estimate 
for the entire file-system operation. Also, AMA estimates 
the allocations done by the kernel along with handling re- 
cursive calls inside file systems. 

Our approach to avoiding memory-allocation failure 
is reminiscent of the banker’s algorithm [15] and other 
deadlock-avoidance techniques. Indeed, with AMA, one 
could build a sort of “memory scheduler” that avoided 
memory over-commitment by delaying some requests un- 
til others frees had taken place, another avenue we plan to 
explore in future work. 

Finally, our approach draws on concurrency control in 
its resemblance to two-phase locking [30], in which all 
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locks are first acquired in an “expanding phase”, then 
used, and then all released during a “shrinking phase”. 
The expanding phase thus bears likeness to our pre- 
allocation request, in that all necessary resources are ac- 
quired up front before they are needed. 


8 Conclusions 


“Act as if it were impossible to fail.’ (Dorothea Brande) 


It is common sense in the world of programming that 
code that is rarely run rarely works. Unfortunately, some 
of the most important code in systems falls into this cate- 
gory, including any code that is run during a “recovery”. 
If the problem that leads to the recovery code being en- 
acted is rare enough, the recovery code itself is unlikely 
to be battle tested, and is thus prone to failure. 


We have presented Anticipatory Memory Allocation 
(AMA), a new approach to avoiding memory-allocation 
failures deep within the kernel. By pre-allocating the 
worst-case allocation immediately upon entry into the ker- 
nel, AMA ensures that requests further downstream will 
never fail, in those places within the code where handling 
failure has proven difficult over the years. The small bits 
of recovery code that are scattered throughout the code 
need never run, and system robustness is improved by de- 
sign. 

As we build increasingly complex systems, perhaps 
we should consider new methods and approaches that 
help build robustness into the system by design. AMA 
presents one method (early resource allocation) to handle 
one problem (memory-allocation failure), but we believe 
that the approach could be applied more generally. Our 
long term goal is to unify mainline code and recovery code 
into one; put another way, the only true manner in which 
to have working recovery code is to have none at all. 
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Abstract 


This paper advocates a device-aware design strategy to 
improve various NAND flash memory system perfor- 
mance metrics. It is well known that NAND flash 
memory program/erase (PE) cycling gradually degrades 
memory device raw storage reliability, and sufficiently 
strong error correction codes (ECC) must be used to en- 
sure the PE cycling endurance. Hence, memory man- 
ufacturers must fabricate enough number of redundant 
memory cells geared to the worst-case device reliability 
at the end of memory lifetime. Given the memory de- 
vice wear-out dynamics, the existing worst-case oriented 
ECC redundancy is largely under-utilized over the en- 
tire memory lifetime, which can be adaptively traded for 
improving certain NAND flash memory system perfor- 
mance metrics. This paper explores such device-aware 
adaptive system design space from two perspectives, in- 
cluding (1) how to improve memory program speed, and 
(2) how to improve memory defect tolerance and hence 
enable aggressive fabrication technology scaling. To en- 
able quantitative evaluation, we for the first time develop 
a NAND flash memory device model to capture the ef- 
fects of PE cycling from the system level. We carry 
out simulations using the DiskSim-based SSD simula- 
tor and a variety of traces, and the results demonstrate 
up to 32% SSD average response time reduction. We 
further demonstrate that the potential on achieving very 
good defect tolerance, and finally show that these two 
design approaches can be readily combined together to 
noticeably improve SSD average response time even in 
the presence of high memory defect rates. 


1 Introduction 


The steady bit cost reduction over the past decade has en- 
abled NAND flash memory to enter increasingly diverse 


*This material is based upon work supported by the National Sci- 
ence Foundation under Grant No. 0937794 


applications, from consumer electronics to personal and 
enterprise computing. In particular, it is now economi- 
cally viable to implement solid-state drives (SSDs) using 
NAND flash memory, which is expected to fundamen- 
tally change the memory and storage hierarchy in future 
computing systems. As the semiconductor industry is 
aggressively pushing NAND flash memory technology 
scaling and the use of multi-bit-per-cell storage scheme, 
NAND flash memory increasingly relies on error correc- 
tion codes (ECC) to ensure the data storage integrity. It 
is well known that NAND flash memory cells gradually 
wear out with the program/erase (PE) cycling [6], which 
is reflected as gradually diminishing memory cell storage 
noise margin (or increasing raw storage bit error rate). To 
meet a specified PE cycling endurance limit, NAND flash 
memory manufacturers must fabricate enough number 
of redundant memory cells that can tolerate the worst- 
case raw storage reliability at the end of memory life- 
time. Clearly, the memory cell wear-out dynamics tend 
to make the existing worst-case oriented ECC redun- 
dancy largely under-utilized over the entire lifetime of 
memory, especially at its early lifetime when PE cycling 
number is relatively small. 


Very intuitively, we may adaptively trade such under- 
utilized ECC redundancy for improving certain NAND 
flash memory system performance metrics throughout 
the memory lifetime. This naturally leads to a PE- 
cycling-aware adaptive NAND flash memory system de- 
sign paradigm. Based upon extensive open literature on 
flash memory devices, we first develop an approximate 
NAND flash memory device model that quantitatively 
captures the dynamic PE cycling effects, including ran- 
dom telegraph noise [15, 17] and interface trap recovery 
and electron detrapping [26, 31, 45], and another major 
noise source: cell-to-cell interference [25]. Such a device 
model makes it possible to explores and quantitatively 
evaluate possible adaptive system design strategies. In 
particular, this paper explores the adaptive system de- 
sign space from two perspectives: (1) Since NAND flash 
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memory program speed also strongly affects the mem- 
ory cell storage noise margin, we could trade the under- 
utilized ECC redundancy to adaptively improve NAND 
flash memory program speed; (2) We could also exploit 
the under-utilized ECC redundancy to realize stronger 
memory cell defect tolerance and hence enable more ag- 
gressive technology scaling. We elaborate on the under- 
lying rationale and realizations of these two design ap- 
proaches. In addition, for the latter one, we propose a 
simple differential wear-leveling strategy in order to min- 
imize its impact on effective PE cycling endurance. 

For the purpose of evaluation, using the developed 
NAND flash memory device model, we first obtain de- 
tailed quantitative memory cell characteristics under dif- 
ferent PE cycling times and different program speed for 
a hypothetical 2bit/cell NAND flash memory. Accord- 
ingly, with the sector size of 512B user data, we construct 
a binary BCH code (4798, 4096, 54) with 1.1% coding 
redundancy that can ensure the data storage integrity at 
the PE cycling limit of 1OK. Using representative work- 
load traces and the SSD model [3] in DiskSim [8], we 
carry out extensive simulations to evaluate the potential 
of trading under-utilized ECC redundancy to improve 
memory program speed while assuming the memory is 
defect-free. The simulation results show that we could 
reduce the SSD average response time by up to 32%. As- 
suming memory defects follow Poisson distributions, we 
further show that the proposed differential wear-leveling 
technique can very effectively improve the effectiveness 
of allocating ECC redundancy for improving memory 
defect tolerance. Finally, we study the combined effects 
when we trade the under-utilized ECC redundancy to im- 
prove memory program speed and realize defect toler- 
ance at the same time. DiskSim-based simulations show 
that, even in the presence of high defect rates, we can still 
achieve noticeable SSD average response time reduction. 


2 Background 


2.1 Memory Erase and Program Basics 


Each NAND flash memory cell is a floating gate tran- 
sistor whose threshold voltage can be configured (or pro- 
grammed) by injecting certain amount of charges into the 
floating gate. Hence, data storage in NAND flash mem- 
ory is realized by programming the threshold voltage 
of each memory cell into two or more non-overlapping 
voltage windows. Before one memory cell can be pro- 
grammed, it must be erased (i.e., remove the charges 
in the floating gate, which sets its threshold voltage to 
the lowest voltage window). NAND flash memory uses 
Fowler-Nordheim (FN) tunneling to realize both erase 
and program [7], because FN tunneling requires very 
low current and hence enables high erase/program par- 
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allelism. It is well known that the threshold voltage of 
erased memory cells tends to have a wide Gaussian-like 
distribution [41]. Hence, we can approximately model 
the threshold voltage distribution of erased state as 


1 _ @=He 
(x) = ——=e 2% 1 
Pe(x) Soak ? (1) 


where LU, and o, are the mean and standard deviation 
of the erased state threshold voltage. Regarding mem- 
ory program, a tight threshold voltage control is typi- 
cally realized by using incremental step pulse program 
(ISPP) [6, 39], i.e., all the memory cells on the same 
word-line are recursively programmed using a program- 
and-verify approach with a stair case program word-line 
voltage V,,. Let AV,, denote the incremental program 
step voltage. For the k-th programmed state with the ver- 
ify voltage vf, ideally ISPP program results in a uni- 
form threshold voltage distribution: 


, k k 
OU) = {a if Vs? <x< Ve" + AVpy 
Dp = “i 


0, else 


(2) 


Unfortunately, the above ideal memory cell thresh- 
old voltage distribution can be (significantly) distorted 
in practice, mainly due to PE cycling and cell-to-cell in- 
terference, which will be discussed in the remainder of 
this section. 


2.2 Effects of PE Cycling 


Flash memory PE cycling causes damage to the tunnel 
oxide of floating gate transistors in the form of charge 
trapping in the oxide and interface states [9, 30, 34], 
which directly results in threshold voltage shift and fluc- 
tuation and hence gradually degrades memory device 
noise margin. Major distortion sources include 


1. Electrons capture and emission events at charge 
trap sites near the interface developed over PE cy- 
cling directly result in memory cell threshold volt- 
age fluctuation, which is referred to as random tele- 
graph noise (RTN) [15, 17]; 


2. Interface trap recovery and electron detrapping [26, 
31,45] gradually reduce memory cell threshold volt- 
age, leading to the data retention limitation. 


Moreover, electrons trapped in the oxide over PE cy- 
cling make it difficult to erase the memory cells, leading 
to a longer erase time, or equivalently, under the same 
erase time, those trapped electrons make the threshold 
voltage of the erased state increase [4,21,27,42]. Most 
commercial flash chips employ erase-and-verify opera- 
tion to prevent the increase of erase state threshold volt- 
age at the penalty of gradually longer erase time with PE 
cycling. 
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Figure |: Illustration of the approximate NAND flash memory device model to incorporate major threshold voltage 


distortion sources. 


RTN causes random fluctuation of memory cell 
threshold voltage, where the fluctuation magnitude is 
subject to exponential decay. Hence, we can model 
the probability density function p,(x) of RTN-induced 
threshold voltage fluctuation as a symmetric exponential 


function [15]: 
1 _Bl 


Pr(x) = a,° Ms (3) 


Let N denote the PE cycling number, 4, scales with N 
in an approximate power-law fashion, i.e., A, is approxi- 
mately proportional to N“, where a tends to be less than 


Interface trap recovery and electron detrapping pro- 
cesses approximately follow Poisson statistics [30], 
hence threshold voltage reduction due to interface trap 
recovery and electron detrapping can be approximately 
modeled as a Gaussian distribution WV (lWg,07). Both Ug 
and oO scale with N in an approximate power-law fash- 
ion, and scale with the retention time f in a logarithmic 
fashion. Moreover, the significance of threshold voltage 
reduction induced by interface trap recovery and electron 
detrapping is also proportional to the initial threshold 
voltage magnitude [27], i.e., the higher the initial thresh- 
old voltage is, the faster the interface trap recovery and 
electron detrapping occur and hence the larger threshold 
voltage reduction will be. 


2.3 Cell-to-Cell Interference 


In NAND flash memory, the threshold voltage shift of 
one floating gate transistor can influence the thresh- 
old voltage of its neighboring floating gate transistors 
through parasitic capacitance-coupling effect [25]. This 
is referred to as cell-to-cell interference, which has been 
well recognized as the one of major noise sources in 
NAND flash memory [24,29,36]. Threshold voltage shift 
of a victim cell caused by cell-to-cell interference can be 
estimated as [25] 

M . f”), 


F=Y (Av; (4) 
k 


where Av,#) represents the threshold voltage shift of one 
interfering cell which is programmed after the victim 


cell, and the coupling ratio yk) is defined as 


ch) 
Crotal 





f= (5) 


where C() is the parasitic capacitance between the in- 
terfering cell and the victim cell, and C;oq; is the total 
capacitance of the victim cell. Cell-to-cell interference 
significance is affected by NAND flash memory bit-line 
structure. In current design practice, there are two differ- 
ent bit-line structures, including conventional even/odd 
bit-line structure [35,40] and emerging all-bit-line struc- 
ture [10,28]. For write, all-bit-line structure writes all the 
cells on the same wordline. In even/odd bit-line struc- 
ture, memory cells on one word-line are alternatively 
connected to even and odd bit-lines and they are pro- 
grammed at different time. Therefore, an even cell is 
mainly interfered by five neighboring cells and an odd 
cell is interfered by only three neighboring cells. There- 
fore even cells and odd cells experience largely differ- 
ent amount of cell-to-cell interference. Cells in all-bit- 
line structure suffers less cell-to-cell inference than even 
cells in odd/even structure, and the all-bit-line structure 
can most effectively support high-speed current sensing 
to improve the memory read and verify speed. Therefore, 
throughout the remainder of this paper, we mainly con- 
sider NAND flash memory with the all-bit-line structure. 


2.4 An Approximate NAND Flash Memory 
Device Model 


Based on the above discussions, we can approximately 
model NAND flash memory device characteristics as 
shown in Fig. 1. Accordingly, we can simulate memory 
cell threshold voltage distribution and the corresponding 
memory cell raw storage reliability. Based upon Eq.(1) 
and Eq.(2), we can obtain the distortion-less threshold 
voltage distribution function p,(x). Recall that pp-(x) 
denotes the RTN distribution function (see Eq.(3)), and 
let Par(x) denote the threshold voltage distribution af- 
ter incorporating RTN, which is obtained by convoluting 


Pp(x) and p(x): 


Par(X) = Pp(x) &) pr(x) (6) 
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Cell-to-cell interference is further incorporated based on 
Eq.(4). To capture the inevitable process variability, we 
set both the vertical coupling ratio ¥, and diagonal cou- 
pling ratio ¥,) are random variables with bounded Gaus- 
sian distributions: 


ae (x—He)? 
-@ 202 








if |x — Ue] < we 
’ 


Pe(x) = (7) 


0, else 


where U, and 0, are the mean and standard deviation, 
and c, is chosen to ensure the integration of this bounded 
Gaussian distribution equals to 1. We set w, = 0.14, and 
0, = 0.4u, in this work. 

Let Pac denote the threshold voltage distribution after 
incorporating cell-to-cell interference, p;(x) denote the 
distribution of threshold voltage fluctuation induced by 
interface trap recovery and electron detrapping, the final 
threshold voltage distribution py is obtained as 


pf (x) = Pac(x) &)p: (x). 


Example 2.1 Let us consider 2bits/cell NAND flash 
memory. We set normalized O¢ and Ue of the erased state 
as 0.35 and 1.4, respectively. For the three programmed 
states, we set the normalized program step voltage AVpp 
as 0.3, and the normalized verify voltages V, as 2.85, 
3.55 and 4.25, respectively. For the RTN distribution 
function p,(x), we set the parameter A, = K;,-N°> where 
Kx, equals to 4 x 10~+. Regarding cell-to-cell interfer- 
ence, according to [36, 35], we set the means of yy and 
Vey aS 0.08 and 0.0048, respectively. For the function 
NV (Ua; 07) to capture interface trap recovery and elec- 
tron detrapping, according to [30, 31], we set that Ug 
scale with N°> and oF scales with N®-®, and both scale 
with In(1+t/to), where t denotes the memory retention 
time and tg is an initial time and can be set as I hour. 
In addition, as pointed out earlier, both Ug and oF also 
depend on the initial threshold voltage. Hence, we set 
that both approximately scale with Ks(x — xo), where x is 
the initial threshold voltage, and xo and K, are constants. 
Therefore, we have 


(8) 


{ Ug = K,(x—x0)KaN°° In(1 +t/to) (9) 


om = Ks(x—x0)KmN°® In(1+t/to) ? 


where we set K, = 0.333, x9 = 1.4, Kg =4 x 10-4, and 
Kin =2 x 10~° by fitting the measurement data presented 
in [30,31]. Accordingly, we carry out Monte Carlo com- 
puter simulations to obtain the cell threshold voltage dis- 
tribution as shown in Fig. 2, which illustrates how RTN, 
cell-to-cell interference, and retention noise affect the 
threshold voltage distribution. 
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Figure 2: Simulated results to show the effects of RTN, 
cell-to-cell interference, and retention noise on memory 
cell threshold voltage distribution. 


3 System Design Adaptive to PE Cycling 


From the above discussions, it is clear that NAND flash 
memory cell raw storage reliability gradually degrades 
with the PE cycling: During the early lifetime of mem- 
ory cells (i.e., the PE cycling number JN is relatively 
small), the aggregated PE cycling effects are relatively 
small, which leads to a relatively large memory cell stor- 
age noise margin and hence good raw storage reliabil- 
ity (i.e., low raw storage bit error rate); since the ag- 
gregated PE cycling effects scale with N in approximate 
power-law fashions, the memory cell storage noise mar- 
gin and hence raw storage reliability gradually degrade 
as the PE cycling number JN increases. Given the target 
PE cycling endurance limit (e.g., LOK PE cycling), each 
memory word-line must have enough redundant mem- 
ory cells so that the corresponding ECC can ensure the 
storage integrity as the PE cycling reaches the endurance 
limit. Due to the memory cell raw storage reliability dy- 
namics, the redundancy geared to the worst-case scenario 
will over-protect the user data for most time throughout 
the entire memory lifetime, especially at its early life- 
time when memory cell operational noise margin is much 
larger. This can be illustrated in Fig. 3, which clearly 
suggests that the redundant memory cells are essentially 
under-utilized at the memory early lifetime. 

Very intuitively, we may trade such under-utilized re- 
dundancy to improve certain memory system perfor- 
mance metrics, which should be carried out adaptive to 
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Figure 3: Illustration of the under-utilized ECC redundancy before reaching PE cycling endurance limit. 


the memory PE cycling. In this work, we explore this 
adaptive memory system design space from two perspec- 
tives as discussed in the remainder of this section. 


3.1 Approach I: Improve Memory Pro- 
gram Speed 


In this subsection, we elaborate on the potential of trad- 
ing the under-utilized ECC redundancy to improve aver- 
age memory program speed. As discussed in Section 2.1, 
NAND flash memory program is carried out recursively 
by sweeping over the entire memory cell threshold volt- 
age range with a program step voltage AV,,. As a re- 
sult, the memory program latency is inversely propor- 
tional to AV,,, which suggests that we can improve the 
memory program speed by increasing AV,,. However, a 
larger AV,, directly results in a wider threshold voltage 
distribution of each programmed state, leading to less 
noise margin between adjacent programmed states and 
hence worse raw storage bit error rate. Therefore, there 
is an inherent trade-off between memory program speed 
vs. memory raw bit error rate, which can be configured 
by adjusting the program step voltage AV,,,. Since the 
memory cell noise margin is further degraded by the PE 
cycling effects as discussed above, a given AV,, will re- 
sult in different noise margin (hence different raw storage 
bit error rate) as memory cells undergo different amount 
of PE cycling. 

In current design practice, AV,, is fixed and its value is 
sufficiently small so that the ECC can tolerate the worst- 
case memory raw storage bit error rate as the PE cycling 
reaches its endurance limit. As a result, the memory 
program speed remains largely unchanged while the raw 
storage bit error rate gradually degrades. Before the PE 
cycling number reaches its endurance limit, the existing 
redundancy is under-utilized as pointed out in the above. 
Clearly, to eliminate such redundancy under-utilization, 
we could intentionally increase the the program step volt- 
age AV,,, according to the run-time PE cycling number in 
such a way that the memory raw storage bit error rate is 
always close to what can be maximally tolerated by the 


existing redundancy. Therefore, the existing redundancy 
is always almost fully utilized, and meanwhile the dy- 
namically increased AV,, leads to higher average mem- 
ory program speed. The above discussion can be further 
illustrated in Fig. 4. 

Although it would be ideal if the program step voltage 
AVpp can be smoothly adjusted with a very fine gran- 
ularity, the limited reference voltage accuracy in real 
NAND flash memory chips may only enable the use of 
a few discrete program step voltages. Assume there are 
m different program step voltages, i.e., INA ” > AV?) > 
> AV”, Given the existing ECC redundancy, we 
can obtain a sequence of PE cycling thresholds No = 0 < 
Ni <--+< Ny So that, if the run-time PE cycling number 
falls into the range of [N;-1,N;), we can use the program 


step voltage v2 and still ensure the overall system data 
storage integrity. If we follow the conventional design 
practice where the program step voltage is fixed accord- 
ing to the worst-case scenario, the smallest step voltage 


Avs will be used throughout the entire memory life- 
time. Therefore, we can estimate the average program 
speed improvement over the entire memory lifetime as 


ye (N; —Nj-1)- L 
1 AVP 


s=1- (10) 


Nn A ta 
AVpp 


3.2 Approach II: Improve Memory Tech- 
nology Scalability 


In this subsection, we elaborate on the potential of trad- 
ing the under-utilized ECC redundancy to improve mem- 
ory defect tolerance. With the help of very sophisticated 
techniques such as double patterning [20], the decade- 
long 193nm photolithography has successfully pushed 
NAND flash memory into the sub-30nm region. How- 
ever, as the industry is striving to push the NAND flash 
memory technology scaling into the sub-20nm region 
by using immersion photolithography or new lithogra- 
phy technologies such as nanoimprint, defects in such 
extremely dense memory arrays may inevitably increase. 
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Figure 4: Illustration of the impact of program step voltage AV, on the program speed vs. raw storage bit error rate 


trade-off. 


As a result, conventional spare row/column repair tech- 
niques may become inadequate to ensure a sufficiently 
high yield. 

Very intuitively, the existing ECC redundancy can be 
leveraged to tolerate memory defects, especially random 
memory cell defects. However, if certain portion of ECC 
redundancy is used for defect tolerance, it will not be 
able to ensure the specified PE cycling limit, leading to 
PE cycling endurance degradation. Since all the pages 
in each memory block undergo the same number of PE 
cycling, the worst-case page (i.e., the page contains the 
most defects) in each block sets the achievable PE cy- 
cling endurance for this block. For example, assume the 
existing ECC redundancy can tolerate up to 50 errors for 
each page and survive up to 10K PE cycling in the ab- 
sence of any memory cell defects. If the worst-case page 
in one block contains 5 defective cells, then it can only 
use the residual 45-error-correcting capability to toler- 
ate memory operational noises such as PE cycling ef- 
fects and cell-to-cell interference. Suppose this makes 
the worst-case page can only survive up to 8K PE cy- 
cling, this block can only be erased by 8K times instead 
of 10K times before risking data loss. 


Clearly, if we attempt to reserve certain ECC re- 
dundancy for tolerating memory cell defects, we must 
minimize the impact on overall memory PE cycling 
endurance. In current design practice, NAND flash 
memory uses wear-leveling to uniformly spread pro- 
gram/erase operations among all the memory blocks to 
maximize the overall memory lifetime. Since different 
memory blocks with different amount of defective mem- 
ory cells can survive different number of PE cycling, uni- 
form wear-leveling is clearly not an optimal option. In- 
stead, we should make wear-leveling fully aware of the 
different achievable PE cycling limits among different 
memory blocks, which is referred to as differential wear- 
leveling. This can be illustrated in Fig. 5: instead of uni- 
formly distributing program/erase operations among all 
the memory blocks, the differential wear-leveling sched- 
ule the program/erase operations among all the memory 
blocks in proportional to their achievable PE cycling lim- 
its. As aresult, we may largely improve the overall mem- 
ory lifetime compared with uniform wear-leveling. 


FAST 711: 9th USENIX Conference on File and Storage Technologies 




















Remained Life 
Endurance 
Block 0 Block 1 Block2 Block 0 Block 1 Block 2 Block 0 Block 1 Block 2 
Early lifetime Middle lifetime End lifetime 
(a) 
Remained Life | 
Endurance: 
Block 0 Block 1 Block2 Block 0 Block 1 Block 2 Block 0 Block 1 Block 2 
Early lifetime Middle lifetime End lifetime 
(b) 


Figure 5: Illustration of (a) conventional uniform wear- 
leveling, and (b) proposed differential wear-leveling, 
where the ECC is used to tolerate defective memory cells 
and hence different blocks may have different achievable 
PE cycling endurance. 


Assume the worst-case page can at most contains M 
defective memory cells, and let P; denote the probability 
that the worst-case page in one block contains d € [0, M] 
defective memory cells. Given the number of defective 
memory cells in the worst-case page d, we can obtain 
the corresponding achievable PE cycling endurance limit 
N(), i.e., the ECC can ensure a PE cycling number up to 
N while tolerating d defective memory cells. Clearly, 
we have NO +> NW >... > NM), where N() is the 
achievable PE cycling limit in the defect-free scenario. 
Define the effective PE cycling endurance as the average 
PE cycling limits of all the memory blocks. Under the 
uniform wear-leveling, the memory chip can only sus- 
tain PE cycling of N (M)_ Therefore, compared with the 
defect-free eek ne = effective PE cycling endurance 
degrades by NO ) IN ), which can result in a significant 
memory lifetime depradation: On the other hand, under 
the ideal differential wear-leveling, each block can reach 
its own PE cycling limit as illustrated in Fig. 5, hence 
the effective PE cycling endurance will be Bee P,-NY, 
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over the uniform wear-leveling. We note that this de- 
sign approach can be combined with the one presented in 
Section 3.1 to improve average memory program speed 
in the presence of memory cell defects. Given the num- 


ber of defective memory cells d and the set of m pro- 


gram step voltage Avs) for 1 <i <™m, we can obtain a 


set of PE cycling thresholds Nd =0< Né <::-<Né 


m? 


i.e., if present PE cycling number falls into the range of 
[N @ nv (@)y we can use the program step voltage Ave? 


aN, 
and rigeniytill ensure the tolerance to d defective mem- 
ory cells. Therefore, for the blocks whose worst-case 
page contains d defective memory cells, the average pro- 
gram speed improvement is 

LO? Nea) oo 
oe Se (12) 


Sq=1- 


The overall average program speed improvement can be 
further estimated as )"", P,- 5). 


4 Evaluation Results 


We carried out simulations and analysis to future demon- 
strate the effectiveness of the above two simple de- 
sign approaches and their combination. To carry out 
trace-based simulations, we use the SSD module [3] in 
DiskSim [8], and use 6 workload traces including Iozone 
and Postmark [3], Financel and Finance2 from [1], and 
Tracel and Trace2 from [16]. The simulator can sup- 
port the use of several parallel packages that can work 
in parallel to improve the SSD throughput. Each pack- 
age contains 2 dies that share an 8-bit I/O bus and a 
number of common control signals, and each die con- 
tains 4 planes and each plane contains 2048 blocks. Each 
block contains 64 4KB pages, each of which consists of 
8 512B sectors. Following the version 2.1 of the Open 
NAND Flash Interface (ONFI) [2], we set the NAND 
flash chip interface bus frequency as 200MB/s. Re- 
garding the ECC, we assume that binary (n, k, t) BCH 
codes are being used, where n is the codeword length, 
k is the user data length (i.e., 512B in this study), and 
t is the error-correcting capability. We consider the use 
of 2bit/cell NAND flash memory, and set the baseline 
2bit/cell NAND flash memory using the equivalent mem- 
ory channel model parameters presented in Example 2.1 
in Section 2.4, for which a (4798, 4096, 54) BCH code 
can ensure a PE cycling endurance limit of 10K under the 
retention time of | year. We note that the target NAND 


flash memory retention time is fixed as | year throughout 
all the studies in this work. 

In this section, we first present trace-based simulation 
results to demonstrate how the first design approach can 
reduce the overall request response time and hence im- 
prove SSD speed performance. Then, we present anal- 
ysis results to demonstrate the second design approach 
by assuming memory cell defects follow Poisson distri- 
bution. Finally, we demonstrate the effectiveness when 
these two approaches are combined together to improve 
SSD speed performance in the presence of memory cell 
defects. 


4.1 Improve SSD Speed Performance 


In the baseline scenario with the parameters listed in Ex- 
ample 2.1, the normalized program step voltage AV, is 
0.3. As discussed in Section 3.1, we can use larger-than- 
worst-case AV,, over the memory lifetime to improve 
memory program speed by exploiting the memory de- 
vice wear-out dynamics. In this work, we assume that 
memory chip voltage generators can increase AV,,, witha 
step of 0.05, hence we consider four different normalized 
values: AV{)) = 0.45, AV = 0.4, AV{) = 0.35, and 
AV’) = 0.3. By carrying out Monte Carlo simulations 
without changing the other memory model parameters, 
we have that these four different program step voltages 
can survive up to N; = 2710, Nz = 4820, N3 = 7500, and 
N4 = 10000 PE cycling, respectively, under the retention 
time of 1 year. Therefore, according to Eq.(10), the av- 
erage NAND flash memory program speed can be im- 
proved by 18% compared with the baseline scenario. We 
further carried out DiskSim-based simulations to investi- 
gate how such improved memory program speed can re- 
duce the SSD average response time (incorporating both 
write and read request response time) for different traces 
under different system configurations. We set that the 
2bit/cell NAND flash memory program latency as 600uUs 
when the normalized program step voltage AV,, is 0.3, 
on-chip memory sensing latency as 30s, and erase time 
as 3ms. 

In this study, we consider the use of 4 and 8 parallel 
packages. Fig. 6 compares the normalized SSD average 
response time when using 4 and 8 parallel packages, re- 
spectively, where we set AV,, as 0.3. It shows that using 
more parallel packages can directly improve SSD speed 
performance, which can be intuitively justified. Fig. 7(a) 
and Fig. 7(b) show the normalized SSD average response 
time under the 4 different normalized program step volt- 
age AV,» for all the 6 traces when the SSD contains 4 and 
8 parallel packages, respectively. We use the first-come 
first-serve (FCFS) scheduling scheme in the simulations. 
Compared with the baseline scenario with AV,, = 0.3, 
the average response time can be reduced by up to ~50% 
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Figure 7: Simulated normalized average response time when the SSD contains (a) 4 parallel packages, and (b) 8 


parallel packages. 
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Figure 6: Comparison of normalized SSD average re- 
sponse time with 4 and 8 parallel packages (AV,» = 0.3). 


with 4 parallel packages and up to ~40% with 8 parallel 
packages. The results show that the use of larger program 
step voltage can consistently improve SSD speed perfor- 
mance under different number of parallel packages. 
Given the PE cycling thresholds N; for i= 1,2,3,4 as 


presented in the above, the NAND flash memory should 


employ the program step voltage Av?) when the present 


PE cycling number falls into [N;-1, N;), where No is 
set to 0. Therefore, based on the the simulation results 
shown in Fig. 7, we can obtain the overall SSD average 
response time reduction compared with the baseline sce- 
nario, as shown in Fig. 8. It shows that this proposed 
design approach can noticeably improve the overall SSD 
speed performance. Intuitively, those traces with higher 
write request ratios (e.g., Iozone, Tracel, and Trace2) 
tend to benefit more from this design approach, as shown 
in Fig. 8. In addition, as we increase the package par- 
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allelism from 4 to 8, the overall response time reduc- 
tion consistently reduces over all the traces. This can be 
explained as follows: As the SSD contains more paral- 
lel packages, the increased architecture-level parallelism 
will directly improve SSD speed performance, as illus- 
trated in Fig. 6. As a result, this will make the improve- 
ment on the device-level program speed become rela- 
tively less significant with respect to the improvement 
of overall system speed performance. 
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Figure 8: Overall SSD average response time reduction 
compared with the baseline scenario when using 4 and 8 


parallel packages. 


In the above simulations, the FCFS scheduling scheme 
has been used. To study the sensitivity of this design ap- 
proach to different scheduling schemes, we repeat the 
above simulations using two other popular scheduling 
schemes including ELEVATOR and SSTF (shortest seek 
time first) [43]. Fig. 9 shows the overall SSD average 
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response time reduction compared with the baseline sce- 
nario, where the SSD contains 4 parallel packages. The 
results show the proposed design approach can consis- 
tently improve overall SSD speed performance under dif- 
ferent scheduling schemes. 
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Figure 9: Overall SSD average response time reduc- 
tion compared with the baseline scenario under different 
scheduling schemes. 


4.2 Improve Defect Tolerance 


To demonstrate the proposed design approach for im- 
proving memory defect tolerance, we assume that the 
number of defective memory cells in each worst-case 
page follows a Poisson distribution that is widely used 
to model defects in integrated circuits. Therefore, un- 
der the Poisson-based distribution model, the probability 
that the worst-case page in each block contains d defec- 
tive memory cells is f(k;A) = aa where the param- 
eter A is the mean of the number of defective memory 
cells in each worst-case page. Given the parameter /, 
we find the value M so that yy f(i;A) > 0.999, and 
assume that any blocks whose worst-case page contains 
more than M defective memory cells can be replaced by 
a redundant block. In this work, we consider the mean A 
ranging from | to 4, and accordingly have that the maxi- 
mum value of M is 12. 

Using the baseline NAND flash memory model pa- 
rameters as listed in Example 2.1, we can obtain the 
achievable PE cycling limit NV (4) for each d, 1.€., we use 
the (4798, 4096, 54) BCH code to tolerate d defective 
memory cells and meanwhile use its residual (54 — d)- 
error-correcting capability to ensure a PE cycling en- 
durance limit of N) under the retention time of 1 year. 
Fig. 10 shows the achievable PE cycling limit N with 
d ranging from 0 to 12. Under different value of mean 
A, we have different value of M, denoted as M), When 
the uniform wear-leveling is being used, the effective PE 
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Figure 10: Achievable PE cycling endurance under dif- 
ferent value of defective memory cells in the worst-case 


page. 


cycling endurance is simply N when d = M“), When 
the proposed differential wear-leveling is being used, the 
effective PE cycling endurance is 





MM) 9 k -A 
€ (4) 
x oan (13) 


for a given mean J. Fig. 11 shows the effective PE cy- 
cling endurance when these two different wear-leveling 
schemes are being used under different value of A. The 
results show that the proposed differential wear-leveling 
can noticeably improve the effective PE cycling en- 
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Figure 11: Effective PE cycling endurance when using 
uniform wear-leveling and differential wear-leveling un- 
der different value of A. 


durance and hence SSD lifetime compared with uniform 
wear-leveling. As the defects density increases (i.e., A 
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increases), the gain of differential wear-leveling over uni- 
form wear-leveling will accordingly improve (i.e., from 
about 10% improvement at A = 1 to about 30% improve- 
ment at A = 4). 


4.3 Combination of the Two Design Ap- 
proaches 


As discussed earlier, we can combine the proposed two 
design approaches in order to improve SSD speed perfor- 
mance when ECC is also used to tolerate defective mem- 
ory cells. Following the discussions in Section 4.2, we 
assume that the number of defective memory cells in the 
worst-case page has a Poisson distribution and consider 
the cases when the mean A ranges from | to 4. Follow- 
ing the discussions in Section 4.1, beyond the normal- 
ized program step voltage AV,,, of 0.3 in the baseline sce- 
nario, we consider three larger values of AV,,,, including 


0.35, 0.4, and 0.45. Denote AV{)) = 0.45, AV) = 0.4, 


Av,2) = 0.35, and AV’) = 0.3. Given the memory cell 
defects number d and the (4798, 4096, 54) BCH code 
being used, we can obtain a set of PE cycling thresh- 
olds Nd =0< Nie .< NG so that, if present PE cy- 


cling number falls into the range of ini N' i) we can 


use the program step voltage Av and meanwhile en- 
sure the tolerance of d defective memory cells. Fig. 12 
shows the PE cycling thresholds when the defect number 
increases from 0 to 12. The results can be intuitively jus- 
tified: as the defect number increases, the residual ECC 
error-correcting capability degrades, and consequently 
the larger program step voltage can only be used over 
a less number of PE cycling. 
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Figure 12: PE cycling thresholds corresponding to dif- 
ferent number of defective cells in the worst-case page 
of one block. 

Given each program step voltage Av), we can obtain 
the normalized SSD response time 7 for each specific 
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trace, as shown in Fig. 7. Recall that, when the PE cy- 
4)) 


cling number falls into the range ini? NI , we can use 


the program step voltage Av), and the baseline scenario 
fixes the program step voltage as Avo) = 0.3 throughout 
the entire memory lifetime. Therefore, we can calculate 
the overall SSD average response time reduction over the 
baseline scenario for each trace as 


Dia(N, @) —NM).4 


Yeasayer (14) 


and the results are shown in Fig. 13. The results suggest 
that we still can maintain a noticeable SSD speed perfor- 
mance improvement when ECC is also used to tolerate 
defective memory cells. 


5 Related Work 


NAND flash memory system design has attracted many 
recent attentions, where most work focused on improv- 
ing system speed performance and endurance. Dirik and 
Jacob [16] studied the effect on SSD system speed per- 
formance by changing various SSD system parallelism 
and concurrency at different levels such as the numbers 
of planes on each channel and the number of channels, 
and compared various existing disk access scheduling al- 
gorithms. Agrawal et al. [3] analyzed the effect of page 
size, striping and interleaving policy on the memory sys- 
tem performance, and proposed a conception of gang as a 
higher-level “superblock” to facilitate SSD system-level 
parallelism configurations. Min and Nam [32] developed 
several NAND flash memory performance enhancement 
techniques such as write request interleaving. Seong 
et al. [37] applied bus-level and chip-level interleaving 
to exploit the inherent parallelism in multiple flash mem- 
ory chips to improve the SSD speed performance. The 
authors of [11,13] applied adaptive bank scheduling poli- 
cies to achieve an even distribution of write request and 
load balance to improve system speed performance. 
Wear-leveling is used to improve NAND flash mem- 
ory endurance. Gal and Toledo [18] surveyed many 
patented and published wear-leveling algorithms and 
data structures for NAND flash memory. Ben-Aroya 
and Toledo [5] more quantitatively evaluated different 
wear-leveling algorithms, including both on-line and 
off-line algorithms. The combination of wear-leveling 
and garbage collection and the involved design trade- 
offs have been investigated by many researchers, e.g., 
see [12, 14,22, 23,44]. In current design practice, defect 
tolerance has been mainly realized by bad block man- 
agement that run-time monitors and disables the future 
use of blocks with defects. Traditional redundant repair 
can also be used to compensate certain memory defects, 
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Figure 13: Overall average response time reduction over the baseline scenario under different A when SSD contains 


(a) 4 parallel packages and (b) 8 parallel packages. 


e.g., see [19]. In addition, a NAND flash memory device 
model was presented in [33], which nevertheless does 
not take into account of RTN noise and cell-to-cell in- 
terference, and the model was used to show that time- 
dependent trap recovery can be leveraged to improve 
memory endurance. 

We note that most prior work on improving SSD sys- 
tem speed performance and/or memory endurance are 
carried out mainly from architecture/system perspective 
to combat flash memory device issues. To the best of 
our knowledge, this paper represents the first attempt to 
adaptively exploit flash memory device characteristics, 
in particular PE-cycling-dependent device wear-out dy- 
namics, at the system level to improve SSD system speed 
performance and NAND flash memory scalability. The 
proposed design approaches are completely orthogonal 
to prior architecture/system level techniques and can be 
readily combined together. 


6 Conclusion 


This paper investigates the potential of adaptively lever- 
aging NAND flash memory cell wear-out dynamics to 
improve memory system performance. As memory PE 
cycling increases, NAND flash memory cell storage 
noise margin and hence raw storage reliability accord- 
ingly degrade. Therefore, the specified PE cycling en- 
durance limit determines the worst-case raw memory 
storage reliability, which further sets the amount of re- 
dundant memory cells that must be fabricated. Motivated 
by the fact that such worst-case oriented redundancy is 
essentially under-utilized over the entire memory life- 
time, especially when the PE cycling number is relatively 
small, this paper proposes to trade such under-utilized re- 


dundancy to improve system speed performance and/or 
tolerate defective memory cells. We further propose a 
simple differential wear-leveling scheme to minimize the 
impact on PE cycling endurance if the redundancy is 
used to tolerate defective memory cells. To quantita- 
tively evaluate such adaptive NAND flash memory sys- 
tem design strategies, we first develop an approximate 
NAND flash memory device model that can capture the 
effects of PE cycling on memory cell storage reliabil- 
ity. To evaluate the effectiveness on improving memory 
system speed, we carry out extensive simulations over a 
variety of traces using the DiskSim-based SSD simula- 
tor under different system configurations, and the results 
show up to 32% SSD average response time reduction 
can be achieved. To evaluate the effectiveness on de- 
fect tolerance, with a Poisson-based defect statics model, 
we show that this design strategy can tolerate relatively 
high defect rates at small degradation of effective PE cy- 
cling endurance. Finally, we show that these two aspects 
can be combined together so that we could noticeably 
reduce SSD average response time even in the presence 
of high memory defect densities. generate the the refer- 
ences with alphatical order. 
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Abstract 


Application launch performance is of great importance 
to system platform developers and vendors as it greatly 
affects the degree of users’ satisfaction. The single most 
effective way to improve application launch performance 
is to replace a hard disk drive (HDD) with a solid state 
drive (SSD), which has recently become affordable and 
popular. A natural question is then whether or not to 
replace the traditional HDD-aware application launchers 
with a new SSD-aware optimizer. 

We address this question by analyzing the inefficiency 
of the HDD-aware application launchers on SSDs and 
then proposing a new SSD-aware application prefetching 
scheme, called the Fast Application STarter (FAST). The 
key idea of FAST is to overlap the computation (CPU) 
time with the SSD access (I/O) time during an applica- 
tion launch. FAST is composed of a set of user-level 
components and system debugging tools provided by the 
Linux OS (operating system). In addition, FAST uses a 
system-call wrapper to automatically detect application 
launches. Hence, FAST can be easily deployed in any 
recent Linux versions without kernel recompilation. We 
implemented FAST on a desktop PC with a SSD running 
Linux 2.6.32 OS and evaluated it by launching a set of 
widely-used applications, demonstrating an average of 
28% reduction of application launch time as compared 
to PC without a prefetcher. 


1 Introduction 


Application launch performance is one of the impor- 
tant metrics for the design or selection of a desktop or 
a laptop PC as it critically affects the user-perceived 
performance. Unfortunately, application launch perfor- 
mance has not kept up with the remarkable progress of 
CPU performance that has thus far evolved according to 
Moore’s law. As frequently-used or popular applications 
get “heavier” (by adding new functions) with each new 


release, their launch takes longer even if a new, power- 
ful machine equipped with high-speed multi-core CPUs 
and several GBs of main memory is used. This undesir- 
able trend is known to stem from the poor random access 
performance of hard disk drives (HDDs). When an ap- 
plication stored in a HDD is launched, up to thousands 
of block requests are sent to the HDD, and a significant 
portion of its launch time is spent on moving the disk 
head to proper track and sector positions, i.e., seek and 
rotational latencies. Unfortunately, the HDD seek and 
rotational latencies have not been improved much over 
the last few decades, especially compared to the CPU 
speed improvement. In spite of the various optimizations 
proposed to improve the HDD performance in launch- 
ing applications, users must often wait tens of seconds 
for the completion of launching frequently-used applica- 
tions, such as Windows Outlook. 

A quick and easy solution to eliminate the HDD’s seek 
and rotational latencies during an application launch is to 
replace the HDD with a solid state drive (SSD). A SSD 
consists of a number of NAND flash memory modules, 
and does not use any mechanical parts, unlike disk heads 
and arms of a conventional HDD. While the HDD ac- 
cess latency—which is the sum of seek and rotational 
latencies—ranges up to a few tens of milliseconds (ms), 
depending on the seek distance, the SSD shows a rather 
uniform access latency of about a few hundred micro- 
seconds (us). Replacing a HDD with a SSD is, there- 
fore, the single most effective way to improve applica- 
tion launch performance. 

Until recently, using SSDs as the secondary storage of 
desktops or laptops has not been an option for most users 
due to the high cost-per-bit of NAND flash memories. 
However, the rapid advance of semiconductor technol- 
ogy has continuously driven the SSD price down, and at 
the end of 2009, the price of an 80 GB SSD has fallen be- 
low 300 US dollars. Furthermore, SSDs can be installed 
in existing systems without additional hardware or soft- 
ware support because they are usually equipped with the 
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same interface as HDDs, and OSes see a SSD as a block 
device just like a HDD. Thus, end-users begin to use a 
SSD as their system disk to install the OS image and ap- 
plications. 

Although a SSD can significantly reduce the applica- 
tion launch time, it does not give users ultimate satisfac- 
tion for all applications. For example, using a SSD re- 
duces the launch time of a heavy application from tens of 
seconds to several seconds. However, users will soon be- 
come used to the SSD launch performance, and will then 
want the launch time to be reduced further, just as they 
see from light applications. Furthermore, users will keep 
on adding functions to applications, making them heav- 
ier with each release and their launch time greater. Ac- 
cording to a recent report [24], the growth of software is 
rapid and limited only by the ability of hardware. These 
call for the need to further improve application launch 
performance on SSDs. 

Unfortunately, most previous optimizers for applica- 
tion launch performance are intended for HDDs and have 
not accounted for the SSD characteristics. Furthermore, 
some of them may rather be detrimental to SSDs. For ex- 
ample, running a disk defragmentation tool on a SSD is 
not beneficial at all because changing the physical loca- 
tion of data in the SSD does not affect its access latency. 
Rather, it generates unnecessary write and erase opera- 
tions, thus shortening the SSD’s lifetime. 

In view of these, the first step toward SSD-aware op- 
timization may be to simply disable the traditional op- 
timizers designed for HDDs. For example, Windows 7 
disables many functions, such as disk defragmentation, 
application prefetch, Superfetch, and Readyboost when 
it detects a SSD being used as a system disk [27]. Let’s 
consider another example. Linux is equipped with four 
disk I/O schedulers: NOOP, anticipatory, deadline, and 
completely fair queueing. The NOOP scheduler almost 
does nothing to improve HDD access performance, thus 
providing the worst performance on a HDD. Surpris- 
ingly, it has been reported that NOOP shows better per- 
formance than the other three sophisticated schedulers on 
a SSD [11]. 

To the best of our knowledge, this is the first attempt 
to focus entirely on improving application launch perfor- 
mance on SSDs. Specifically, we propose a new appli- 
cation prefetching method, called the Fast Application 
STarter (FAST), to improve application launch time on 
SSDs. The key idea of FAST is to overlap the compu- 
tation (CPU) time with the SSD access (I/O) time dur- 
ing each application launch. To achieve this, we monitor 
the sequence of block requests in each application, and 
launch the application simultaneously with a prefetcher 
that generates I/O requests according to the a priori mon- 
itored application’s I/O request sequence. FAST consists 
of a set of user-level components, a system-call wrap- 
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per, and system debugging tools provided by the Linux 
OS. FAST can be easily deployed in most recent Linux 
versions without kernel recompilation. We have imple- 
mented and evaluated FAST on a desktop PC with a SSD 
running Linux 2.6.32, demonstrating an average of 28% 
reduction of application launch time as compared to PC 
without a prefetcher. 
This paper makes the following contributions: 


e Qualitative and quantitative evaluation of the ineffi- 
ciency of traditional HDD-aware application launch 
optimizers on SSDs; 


e Development of a new SSD-aware application 
prefetching scheme, called FAST; and 


e Implementation and evaluation of FAST, demon- 
strating its superiority and deployability. 


While FAST can be also applied to HDDs, its per- 
formance improvements are only limited to high I/O re- 
quirements of application launches on HDDs. We ob- 
served that existing application prefetchers outperformed 
FAST on HDDs by effectively optimizing disk head 
movements, which will be discussed further in Section 5. 

The paper is organized as follows. In Section 2, we re- 
view other related efforts and discuss their performance 
in optimizing application launch on SSDs. Section 3 
describes the key idea of FAST and presents an upper 
bound for its performance. Section 4 details the imple- 
mentation of FAST on the Linux OS, while Section 5 
evaluates its performance using various real-world appli- 
cations. Section 6 discusses the applicability of FAST to 
smartphones and Section 7 compares FAST with tradi- 
tional I/O prefetching techniques. We conclude the paper 
with Section 8. 


2 Background 


2.1 Application Launch Optimization 


Application-level optimization. Application developers 
are usually advised to optimize their applications for fast 
startup. For example, they may be advised to postpone 
loading non-critical functions or libraries so as to make 
applications respond as fast as possible [2, 30]. They 
are also advised to reduce the number of symbol reloca- 
tions while loading libraries, and to use dynamic library 
loading. There have been numerous case studies—based 
on in-depth analyses and manual optimizations—of vari- 
ous target applications/platforms, such as Linux desktop 
suite platform [8], a digital TV [17], and a digital still 
camera [33]. However, such an approach requires the 
experts’ manual optimizations for each and every appli- 
cation. Hence, it is economically infeasible for general- 
purpose systems with many (dynamic) application pro- 
grams. 
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Snapshot technique. A snapshot boot technique has 
also been suggested for fast startup of embedded systems 
[19], which is different from the traditional hibernate 
shutdown function in that a snapshot of the main mem- 
ory after booting an OS is captured only once, and used 
repeatedly for every subsequent booting of the system. 
However, applying this approach for application launch 
is not practical for the following reasons. First, the page 
cache in main memory is shared by all applications, and 
separating only the portion of the cache content that is 
related to a certain application is not possible without 
extensive modification of the page cache. Furthermore, 
once an application is updated, its snapshot should be in- 
validated immediately, which incurs runtime overhead. 
Prediction-based prefetch. | Modern desktops are 
equipped with large (up to several GBs) main memory, 
and often have abundant free space available in the main 
memory. Prediction-based prefetching, such as Super- 
fetch [28] and Preload [12], loads an application’s code 
blocks in the free space even if the user does not ex- 
plicitly express his intent to execute that particular ap- 
plication. These techniques monitor and analyze the 
users’ access patterns to predict which applications to be 
launched in future. Consequently, the improvement of 
launch performance depends strongly on prediction ac- 
curacy. 

Sorted prefetch. The Windows OS is equipped with 
an application prefetcher [36] that prefetches appli- 
cation code blocks in a sorted order of their logical 
block addresses (LBAs) to minimize disk head move- 
ments. A similar idea has also been implemented for 
Linux OS [15, 25]. We call these approaches sorted 
prefetch. It monitors HDD activities to maintain a list 
of blocks accessed during the launch of each application. 
Upon detection of an application launch, the application 
prefetcher immediately pauses its execution and begins 
to fetch the blocks in the list in an order sorted by their 
LBAs. The application launch is resumed after fetching 
all the blocks, and hence, no page miss occurs during the 
launch. 

Application defragmentation. The block list informa- 
tion can also be used in a different way to further reduce 
the seek distance during an application launch. Modern 
OSes commonly support a HDD defragmentation tool 
that reorganizes the HDD layout so as to place each file in 
a contiguous disk space. In contrast, the defragmentation 
tool can relocate the blocks in the list of each application 
by their access order [36], which helps reduce the total 
HDD seek distance during the launch. 

Data pinning on flash caches. Recently, flash cache has 
been introduced to exploit the advantage of SSDs at a 
cost comparable to HDDs. A flash cache can be inte- 
grated into traditional HDDs, which is called a hybrid 
HDD [37]. Also, a PCI card-type flash cache is available 


[26], which is connected to the mother board of a desk- 
top or laptop PC. As neither seek nor rotational latency is 
incurred while accessing data in the flash cache, we can 
accelerate application launch by storing the code blocks 
of frequently-used applications, which is called a pinned 
set. Due to the small capacity of flash cache, how to 
determine the optimal pinned set subject to the capacity 
constraint is a key to making performance improvement, 
and a few results of addressing this problem have been 
reported [16, 18, 22]. We expect that FAST can be in- 
tegrated with the flash cache for further improvement of 
performance, but leave it as part of our future work. 


2.2 SSD Performance Optimization 


SSDs have become affordable and begun to be deployed 
in desktop and laptop PCs, but their performance char- 
acteristics have not yet been understood well. So, re- 
searchers conducted in-depth analyses of their perfor- 
mance characteristics, and suggested ways to improve 
their runtime performance. Extensive experiments have 
been carried out to understand the performance dynam- 
ics of commercially-available SSDs under various work- 
loads, without knowledge of their internal implementa- 
tions [7]. Also, SSD design space has been explored 
and some guidelines to improve the SSD performance 
have been suggested [10]. A new write buffer manage- 
ment scheme has also been suggested to improve the ran- 
dom write performance of SSDs [20]. Traditional I/O 
schedulers optimized for HDDs have been revisited in 
order to evaluate their performance on SSDs, and then 
a new I/O scheduler optimized for SSDs has been pro- 
posed [11, 21]. 


2.3. Launch Optimization on SSDs 


As discussed in Section 2.1, various approaches have 
been developed and deployed to improve the applica- 
tion launch performance on HDDs. On one hand, many 
of them are effective on SSDs as well, and orthogo- 
nal to FAST. For example, application-level optimiza- 
tion and prediction-based prefetch can be used together 
with FAST to further improve application launch perfor- 
mance. 

On the other hand, some of them exploit the HDD 
characteristics to reduce the seek and rotational delay 
during an application launch, such as the sorted prefetch 
and the application defragmentation. Such methods are 
ineffective for SSDs because the internal structure of a 
SSD is very different from that of a HDD. A SSD typi- 
cally consists of multiple NAND flash memory modules, 
and does not have any mechanical moving part. Hence, 
unlike a HDD, the access latency of a SSD is irrelevant to 
the LBA distance between the last and the current block 
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Figure 1: Various application launch scenarios (n = 4). 


requests. Thus, prefetching the application code blocks 
according to the sorted order of their LBAs or changing 
their physical locations will not make any significant per- 
formance improvement on SSDs. As the sorted prefetch 
has the most similar structure to FAST, we will quanti- 
tatively compare its performance with FAST in Section 
D. 


3 Application Prefetching on SSDs 


This section illustrates the main idea of FAST with exam- 
ples and derives a lower bound of the application launch 
time achievable with FAST. 


3.1 Cold and Warm Starts 


We focus on the performance improvement in case of 
a cold start, or the first launch of an application upon 
system bootup, representing the worst-case application 
launch performance. Figure 1(a) shows an example cold 
start scenario, where s; is the i-th block request gener- 
ated during the launch and n the total number of block 
requests. After s; is completed, the CPU proceeds with 
the launch process until another page miss takes place. 
Let c; denote this computation. 

The opposite extreme is a warm start in which all the 
code blocks necessary for launch have been found in the 
page cache, and thus, no block request is generated, as 
shown in Figure 1(b). This occurs when the application 
is launched again shortly after its closure. The warm start 
represents an upper-bound of the application launch per- 
formance improvement achievable with optimization of 
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the secondary storage. 

Let the time spent for s; and c; be denoted by t(s;) and 
t(c;), respectively. Then, the computation (CPU) time, 
tepuy 18 expressed as 


n 
tag) 0G), (1) 
i=l 
and the SSD access (I/O) time, t;;7, is expressed as 
n 
tssd = ys) (2) 
i=l 


3.2 The Proposed Application Prefetcher 


The rationale behind FAST is that the I/O request se- 
quence generated during an application launch does not 
change over repeated launches of the application in case 
of cold-start. The key idea of FAST is to overlap the SSD 
access (I/O) time with the computation (CPU) time by 
running the application prefetcher concurrently with the 
application itself. The application prefetcher replays the 
I/O request sequence of the original application, which 
we call an application launch sequence. An application 
launch sequence S can be expressed as (51,...,5n). 
Figure 1(c) illustrates how FAST works, where fepy > 
tssq is assumed. At the beginning, the target applica- 
tion and the prefetcher start simultaneously, and compete 
with each other to send their first block request to the 
SSD. However, the SSD always receives the same block 
request s; regardless of which process gets the bus grant 
first. After s] is fetched, the application can proceed with 
its launch by the time t(c,), while the prefetcher keeps 
issuing the subsequent block requests to the SSD. After 
completing c;, the application accesses the code block 
corresponding to s2, but no page miss occurs for sz be- 
cause it has already been fetched by the prefetcher. It is 
the same for the remaining block requests, and thus, the 
resulting application launch time tjgunch becomes 


(3) 


Figure 1(d) shows another possible scenario where fe py, < 
tssq- In this case, the prefetcher cannot complete fetching 
Sq before the application finishes computation c;. How- 
ever, 5. can be fetched by t(c;) earlier than that of the 
cold start, and this improvement is accumulated for all 
of the remaining block requests, resulting in tygunch! 


tlaunch = t(s1) + lopu- 


(4) 


Note that n ranges up to a few thousands for typical ap- 
plications, and thus, t(s1) < tepy and t(¢n) K tssq. Con- 
sequently, Eqs. (3) and (4) can be combined into a single 
equation as: 


tlaunch = tssd + t(Cn) . 


(5) 


tlaunch © max (fss , tepu) ’ 
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Figure 3: The proposed application prefetching. 


which represents a lower bound of the application launch 
time achievable with FAST. 


However, FAST may not achieve application launch 
performance close to Eq. (5) when there is a significant 
variation of I/O intensiveness, especially if the beginning 
of the launch process is more I/O intensive than the other. 
Figure 2 illustrates an extreme example of such a case, 
where the first half of this example is SSD-bound and the 
second half is CPU-bound. In this example, fp, is equal 
to tssq, and thus the expected launch time fe, pected iS given 
to be tysa +t(cg), according to Eq. (4). However, the ac- 
tual launch time fgctuai 1s much larger than fexpected. The 
CPU usage in the first half of the launch time is kept quite 
low despite the fact that there are lots of remaining CPU 
computations (1.e., c5,...,Cg) due to the dependency be- 
tween s; and c;. We will provide a detailed analysis for 
this case using real applications in Section 5. 


4 Implementation 


We chose the Linux OS to demonstrate the feasibility and 
the superior performance of FAST. The implementation 
of FAST consists of a set of components: an application 
launch manager, a system-call profiler, a disk I/O pro- 
filer, an application launch sequence extractor, a LBA- 
to-inode reverse mapper, and an application prefetcher 
generator. Figure 3 shows how these components inter- 
act with each other. In what follows, we detail the imple- 
mentation of each of these components. 


4.1 Application Launch Sequence 
4.1.1 Disk I/O Profiler 


The disk I/O profiler is used to track the block re- 
quests generated during an application launch. We used 
Blktrace [3], a built-in Linux kernel I/O-tracing tool 
that monitors the details of I/O behavior for the evalua- 
tion of I/O performance. Blktrace can profile various 
I/O events: inserting an item into the block layer, merg- 
ing the item with a previous request in the queue, remap- 
ping onto another device, issuing a request to the device 
driver, and a completion signal from the device. From 
these events, we collect the trace of device-completion 
events, each of which consists of a device number, a 
LBA, the I/O size, and completion time. 


4.1.2 Application Launch Sequence Extractor 


Ideally, the application launch sequence should include 
all of the block requests that are generated every time the 
application is launched in the cold start scenario, with- 
out including any block requests that are not relevant to 
the application launch. We observed that the raw block 
request sequence captured by Blktrace does not vary 
from one launch to another, i.e., deterministic for mul- 
tiple launches of the same application. However, we 
observed that other processes (e.g., OS and application 
daemons) sometimes generate their own I/O requests si- 
multaneously with the application launch. To handle this 
case, the application launch sequence extractor collects 
two or more raw block request sequences to extract a 
common sequence, which is then used as a launch se- 
quence of the corresponding application. The imple- 
mentation of the application launch sequence extractor 
is simple: it searches for and removes any block requests 
appearing in some of the input sequences. This proce- 
dure makes all the input sequences the same, so we use 
any of them as an application launch sequence. 


4.2 LBA-to-Inode Map 
4.2.1 LBA-to-Inode Reverse Mapper 


Our goal is to create an application prefetcher that gen- 
erates exactly the same block request sequence as the 
obtained application launch sequence, where each block 
request is represented as a tuple of starting LBA and 
size. Since the application prefetcher is implemented as 
a user-level program, every disk access should be made 
via system calls with a file name and an offset in that file. 
Hence, we must obtain the file name and the offset of 
each block request in an application launch sequence. 
Most file systems, including EXT3, do not support 
such a reverse mapping from LBA to file name and off- 
set. However, for a given file name, we can easily find 
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the LBA of all of the blocks that belong to the file and 
their relative offset in the file. Hence, we can build a 
LBA-to-inode map by gathering this information for ev- 
ery file. However, building such a map of the entire file 
system is time-consuming and impractical because a file 
system, in general, contains tens of thousands of files and 
their block locations on the disk change very often. 

Therefore, we build a separate LBA-to-inode map 
for each application, which can significantly reduce the 
overhead of creating a LBA-to-inode map because (1) 
the number of applications and the number of files used 
in launching each application are very small compared 
to the number of files in the entire file system; and (2) 
most of them are shared libraries and application code 
blocks, so their block locations remain unchanged unless 
they are updated or disk defragmentation is performed. 

We implement the LBA-to-inode reverse mapper that 
receives a list of file names as input and creates a LBA- 
to-inode map as output. A LBA-to-inode map is built 
using a red-black tree in order to reduce the search time. 
Each node in the red-black tree has the LBA of a block as 
its key, and a block type as its data by default. According 
to the block type, different types of data are added to 
the node. A block type includes a super block, a group 
descriptor, an inode block bitmap, a data block bitmap, 
an inode table, and a data block. For example, a node for 
a data block has a block type, a device number, an inode 
number, an offset, and a size. Also, for a data block, a 
table is created to keep the mapping information between 
an inode number and its file name. 


4.2.2 System-Call Profiler 


The system-call profiler obtains a full list of file names 
that are accessed during an application launch,! and 
passes it to the LBA-to-inode reverse mapper. We used 
strace for the system-call profiler, which is a debugging 
tool in Linux. We can specify the argument of strace 
so that it may monitor only the system calls that have a 
file name as their argument. As many of these system 
calls are rarely called during an application launch, we 
monitor only the following system calls that frequently 
occur during application launches: open(), creat(), 
execve(), stat(), stat64(), lstat(), lstat64(), 
access(), truncate(), truncate64(), statfs(), 
statfs64(), readlink(), and unlink(). 


4.3 Application Prefetcher 
4.3.1 Application Prefetcher Generator 


The application prefetcher is a user-level program that 
replays the disk access requests made by a target appli- 


'Files mounted on pseudo file systems such as procfs and sysfs 
are not processed because they never generate any disk I/O request. 


FAST ’11: 9th USENIX Conference on File and Storage Technologies 


Table 1: System calls to replay access of blocks in an 
application launch sequence 


Block type System call 


Tnode table 
Data block: a directory opendir() and readdir() 


Data block: a regular file read() or posix_fadvise() 


Data block: 
link file 





cation. We implemented the application prefetcher gen- 
erator to automatically create an application prefetcher 
for each target application. It performs the following op- 
erations. 


1. Read s; one-by-one from S of the target application. 


2. Convert s; into its associated data items stored in the 
LBA-to-inode map, e.g., 
(dev,LBA,size)—> (datablk,filename,offset,size) OF 


(dev,LBA,size)—> (inode, start_inode,end_inode). 


3. Depending on the type of block, generate an appro- 
priate system call using the converted disk access 
information. 


4. Repeat Steps 1-3 until processing all 5;. 


Table 1 shows the kind of system calls used for each 
block type. There are two system calls that can be 
used to replay the disk access for data blocks of a reg- 
ular file. If we use read(), data is first moved from 
the SSD to the page cache, and then copying takes 
place from the page cache to the user buffer. The sec- 
ond step is unnecessary for our purpose, as the process 
that actually manipulates the data is not the application 
prefetcher but the target application. Hence, we chose 
posix_fadvise() that performs only the first step, from 
which we can avoid the overhead of read(). We use 
the POSIX_FADV_WILLNEED parameter, which informs 
the OS that the specified data will be used in the near 
future. When to issue the corresponding disk access af- 
ter posix_fadvise() is called depends on the OS im- 
plementation. We confirmed that the current version of 
Linux we used issues a block request immediately after 
receiving the information through posix_fadvise(), 
thus meeting our need. A symbolic-linked file name is 
stored in data block pointers in an inode entry when the 
length of the file name is less than or equal to 60 bytes 
(c.f., the space of data block pointers is 60 bytes, 4*12 
for direct, 4 for single indirect, another 4 for double in- 
direct, and last 4 for triple indirect data block pointer). 
If the length of linked file name is more than 60 bytes, 
the name is stored in the data blocks pointed to by data 
block pointers in the inode entry. We use readlink() to 
replay the data block access of symbolic-link file names 
that are longer than 60 bytes. 
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int main(void) { 


readlink("/etc/fonts/conf .d/90-ttf-arphic-uming-emb 
olden.conf", linkbuf, 256); 

int £d423; 

£d423 = open("/etc/fonts/conf .d/90-ttf-arphic-uming 
-embolden.conf", O_RDONLY) ; 

posix_fadvise(fd423, 0, 4096, POSIX_FADV_WILLNEED) ; 
posix_fadvise(fd351, 286720, 114688, POSIX_FADV_WIL 
LNEED) ; 

int £d424; 

£d424 = open("/usr/share/fontconfig/conf.avail/90-tt 
f-arphic-uming-embolden.conf", O_RDONLY) ; 
posix_fadvise(fd424, 0, 4096, POSIX_FADV_WILLNEED) ; 
int £d425; 

£d425 = open("/root/.gnupg/trustdb.gpg", O_RDONLY) ; 
posix_fadvise(fd425, 0, 4096, POSIX_FADV_WILLNEED) ; 
dirp = opendir("/var/cache/") ; 

if (dirp)while(readdir(dirp)) ; 


Havana 0; 
Figure 4: An example application prefetcher. 


Figure 4 is an example of automatically-generated ap- 
plication prefetcher. Unlike the target application, the 
application prefetcher successively fetches all the blocks 
as soon as possible to minimize the time between adja- 
cent block requests. 


4.3.2 Implicitly-Prefetched Blocks 


In the EXT3 file system, the inode of a file includes 
pointers of up to 12 data blocks, so these blocks can 
be found immediately after accessing the inode. If the 
file size exceeds 12 blocks, indirect, double indirect, and 
triple indirect pointer blocks are used to store the point- 
ers to the data blocks. Therefore, requests for indirect 
pointer blocks may occur in the cold start scenario when 
the application is accessing files larger than 12 blocks. 
We cannot explicitly load those indirect pointer blocks in 
the application prefetcher because there is no such sys- 
tem call. However, the posix_fadvise() call for a data 
block will first make a request for the indirect block when 
needed, so it can be fetched in a timely manner by run- 
ning the application prefetcher. 

The following types of block request are not listed in 
Table 1: a superblock, a group descriptor, an inode entry 
bitmap, a data block bitmap. We found that requests to 
these types of blocks seldom occur during an application 
launch, so we did not consider their prefetching. 


4.4 Application Launch Manager 


The role of the application launch manager is to detect 
the launch of an application and to take an appropri- 
ate action. We can detect the beginning of an applica- 
tion launch by monitoring execve() system call, which 
is implemented using a system-call wrapper. There are 
three phases with which the application launch manager 


Table 2: Variables and parameters used by the applica- 
tion launch manager 


Type Description 


Ninit A counter to record the number of application 
launches done in the initial launch phase 

A counter to record the number of launches 
done in the application prefetch phase after the 
last check of the miss ratio of the application 
prefetcher 

The number of raw block request sequences that 
are to be captured at the launch profiling phase 
The period to check the miss ratio of the applica- 
tion prefetcher 

A threshold value for the prefetcher miss ratio that 
is used to determine if an update of the application 
or shared libraries has taken place 

A threshold value for the idle time period that is 
used to determine if an application launch is com- 
pleted 

The maximum amount of time allowed for the 
disk I/O profiler to capture block requests 


"pref 


N, rawseq 
N chk 


Rniss 


Tidle 


Trimeout 











deals: a launch profiling phase, a prefetcher generation 
phase, and an application prefetch phase. The applica- 
tion launch manager uses a set of variables and param- 
eters for each application to decide when to change its 
phase. These are summarized in Table 2. 

Here we describe the operations performed in each 
phase: 
(1) Launch profiling. If no application prefetcher is 
found for that application, the application launch man- 
ager regards the current launch as the first launch of this 
application, and enters the initial launch phase. In this 
phase, the application launch manager performs the fol- 
lowing operations in addition to the launch of the target 
application: 


1. Increase nj; of the current application by 1. 
2. If ninix = 1, run the system call profiler. 


3. Flush the page cache, dentries (directory entries), 
and inodes in the main memory to ensure a cold start 
scenario, which is done by the following command: 

echo 3 > /proc/sys/vm/drop_caches 


4. Run the disk I/O profiler. Terminate the disk I/O 
profiler when any of the following conditions are 
met: (1) if no block request occurs during the last 
Tidie Seconds or (2) the elapsed time since the start 
of the disk I/O profiler exceeds Tyimeour Seconds. 


5. If ninit = Nrawseg, enter the prefetcher generation 
phase after the current launch is completed. 


(2) Prefetcher generation. Once application launch 
profiling is done, it is ready to generate an application 
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prefetcher using the information obtained from the first 
phase. This can be performed either immediately after 
the application launch is completed, or when the system 
is idle. The following operations are performed: 


1. Run the application launch sequence extractor. 
2. Run the LBA-to-inode reverse mapper. 
3. Run the application prefetcher generator. 


4. Reset the values of nini and Nprer to 0. 


(3) Application prefetch. If the application prefetcher 
for the current application is found, the application 
launch manager runs the prefetcher simultaneously with 
the target application. It also periodically checks the miss 
ratio of the prefetcher to determine if there has been any 
update of the application or shared libraries. Specifically, 
the following operations are performed: 


1. Increase np,e¢ of the current application by 1. 


2. If Mpre¢ = Neng, reset the value of npref to 0 and run 
the disk I/O profiler. Its termination conditions are 
the same as those in the first phase. 


3. Run the application prefetcher simultaneously with 
the target application. 


4. If a raw block request sequence is captured, use it to 
calculate the miss ratio of the application prefetcher. 
If it exceeds Ryjss, delete the application prefetcher. 


The miss ratio is defined as the ratio of the number of 
block requests not issued by the prefetcher to the total 
number of block requests in the application launch se- 
quence. 


5 Performance Evaluation 


5.1 Experimental Setup 


Experimental platform. We used a desktop PC 
equipped with an Intel 17-860 2.8 GHz CPU, 4GB of 
PC12800 DDR3 SDRAM and an Intel 80GB SSD (X25- 
M G2 Mainstream). We installed a Fedora 12 with Linux 
kernel 2.6.32 on the desktop, in which we set NOOP 
as the default I/O scheduler. For benchmark applica- 
tions, we chose frequently used user-interactive appli- 
cations, for which application launch performance mat- 
ters much. Such an application typically uses graphical 
user interfaces and requires user interaction immediately 
after completing its launch. Applications like gcc and 
gzip are not included in our set of benchmarks as launch 
performance is not an issue for them. Our benchmark 
set consists of the following Linux applications: Acro- 
bat reader, Designer-qt4, Eclipse, F-Spot, Firefox, Gimp, 
Gnome, Houdini, Kdevdesigner, Kdevelop, Konqueror, 
Labview, Matlab, OpenOffice, Skype, Thunderbird, and 
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XilinxISE. In addition to these, we used Wine [1], which 
is an implementation of the Windows API running on the 
Linux OS, to test Access, Excel, Powerpoint, Visio, and 
Word—typical Windows applications. 

Test scenarios. For each benchmark application, we 
measured its launch time for the following scenarios. 


e Cold start: The application is launched immediately 
after flushing the page cache, using the method de- 
scribed in Section 4.4. The resulting launch time is 
denoted by foi. 


Warm start: We first run the application prefetcher 
only to load all the blocks in the application launch 
sequence to the page cache, and then launch the 
application. Let t,arm denote the resulting launch 
time. 


Sorted prefetch: To evaluate the performance of the 
sorted prefetch [15, 25, 36] on SSDs, we modify the 
application prefetcher to fetch the block requests in 
the application launch sequence in the sorted order 
of their LBAs. After flushing the page cache, we 
first run the modified application prefetcher, then 
immediately run the application. Let tyoeq denote 
the resulting launch time. 


e FAST: We flush the page cache, and then run 
the application simultaneously with the application 
prefetcher. The resulting launch time is denoted by 


FAST - 


Prefetcher only: We flush the page cache and run 
the application prefetcher. The completion time 
of the application prefetcher is denoted by f,,. It 
is used to calculate a lower bound of the appli- 
cation launch time tyound = MAaX(tssa,tepu), Where 
tepu = twarm 18 assumed. 


Launch-time measurement. We start an application 
launch by clicking an icon or inputting a command, and 
can accurately measure the launch start time by moni- 
toring when execve() is called. Although it is difficult 
to clearly define the completion of a launch, a reasonable 
definition is the first moment the application becomes re- 
sponsive to the user [2]. However, it is difficult to accu- 
rately and automatically measure that moment. So, as 
an alternative, we measured the completion time of the 
last block request in an application launch sequence us- 
ing Blktrace, assuming that the launch will be com- 
pleted very soon after issuing the last block request. For 
the warm start scenario, we executed posix_fadvise() 
with POSIX_FADV_DONTNEED parameter to evict the last 
block request from the page cache. For the sorted 
prefetch and the FAST scenarios, we modified the ap- 
plication prefetcher so that it skips prefetching of the last 
block request. 
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Figure 5: The size of application launch sequences. 


5.2 Experimental Results 


Application launch sequence generation. We captured 
10 raw block request sequences during the cold start 
launch of each application. We ran the application launch 
sequence extractor with a various number of input block 
request sequences, and observed the size of the result- 
ing application launch sequences. Figure 5 shows that 
for all the applications we tested, there is no significant 
reduction of the application launch sequence size while 
increasing the number of inputs from 2 to 10. Hence, we 
set the value of N-awseq in Table 2 to 2 in this paper. We 
used the size of the first captured input sequence as the 
number of inputs one in Figure 5 (the application launch 
sequence extractor requires at least two input sequences). 
For some applications, there are noticeable differences in 
size between the number of inputs one and two. This is 
because the first raw input request sequence includes a 
set of bursty I/O requests generated by OS and user dae- 
mons that are irrelevant to the application launch. Fig- 
ure 5 shows that such I/O requests can be effectively 
excluded from the resulting application launch sequence 
using just two input request sequences. 

The second and third columns of Table 3 summarize 

the total number of block requests and accessed blocks of 
the thus-obtained application launch sequences, respec- 
tively. The last column shows the total number of files 
used during the launch of each application. 
Testing of the application prefetcher. Application 
prefetchers are automatically generated for the bench- 
mark applications using the application launch sequences 
in Table 3. In order to see if the application prefetch- 
ers fetch all the blocks used by an application, we 
first flushed the page cache, and launched each applica- 
tion immediately after running the application prefetcher. 
During the application launch, we captured all the block 
requests generated using Blktrace, and counted the 
number of missed block requests. The average number of 
missed block requests was 1.6% of the number of block 
requests in the application launch sequence, but varied 
among repeated launches, e.g., from 0% to 6.1% in the 
experiments we performed. 


Table 3: Collected launch sequences (N;awseq = 2) 


Application # of block | #of fetched | # of used 
requests blocks files 
Access 1296 106 992 555 
Acrobat reader 960 73 784 178 
Designer-qt4 2400 138 608 410 
Eclipse 4163 155 216 787 
Excel 1610 169 112 583 
F-Spot 1180 49 968 304 
Firefox 1566 60944 433 
Gimp 1939 66 928 799 
Gnome 4739 228 872 538 
Houdini 4836 290 320 724 
Kdevdesigner 1537 44 904 467 
Kdevelop 1970 63 104 372 
Konqueror 1780 62216 296 
Labview 2927 154 768 354 
Matlab 6125 267 312 742 
OpenOffice 1425 104 600 308 
Powerpoint 1405 120 808 576 
Skype 892 41560 197 
Thunderbird 1533 64784 429 
Visio 1769 168 832 662 
Word 1715 181 496 613 
Xilinx ISE 4718 328 768 351 

















By examining the missed block requests, we could cat- 
egorize them into three types: (1) files opened by OS 
daemons and user daemons at boot time; (2) journaling 
data or swap partition accesses; and (3) files dynamically 
created or renamed at every launch (e.g., tmpfile()). 
The first type occurs because we force the page cache to 
be flushed in the experiment. In reality, they are highly 
likely to reside in the page cache, and thus, this type of 
misses will not be a problem. The second type is irrel- 
evant to the application, and observed even during idle 
time. The third type occurs more or less often, depend- 
ing on the application. FAST does not prefetch this type 
of block requests as they change at every launch. 


Experiments for the test scenarios. We measured the 
launch time of the benchmark applications for each test 
scenario listed in Section 5.1. Figure 6 shows that the 
average launch time reduction of FAST is 28% over the 
cold start scenario. The performance of FAST varies 
considerably among applications, ranging from 16% to 
46% reduction of launch time. In particular, FAST shows 
performance very close to fyo,nq for some applications, 
such as Eclipse, Gnome, and Houdini. On the other hand, 
the gap between fyoynq and trast is relatively larger for 
such applications as Acrobat reader, Firefox, OpenOf- 
fice, and Labview. 


Launch time behavior. We conducted experiments to 
see if the application prefetcher works well as expected 
when it is simultaneously run with the application. We 
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Application: Firefox 
Figure 7: Usage of CPU and SSD (sampling rate = 1 KHz). 


chose Firefox because it shows a large gap between 
thound and trasr. We monitored the generated block re- 
quests during the launch of Firefox with the application 
prefetcher, and observed that the first 12 of the entire 
1566 block requests were issued by Firefox, which took 
about 15 ms. As the application prefetcher itself should 
be launched as well, FAST cannot prefetch these block 
requests until finishing its launch. However, we ob- 
served that all the remaining block requests were issued 
by FAST, meaning that they are successfully prefetched 
before the CPU needs them. 


CPU and SSD usage patterns. We performed another 
experiment to observe the CPU and SSD usage patterns 
in each test scenario. We chose two applications, Eclipse 
and Firefox, representing the two groups of applications 
of which trasr is close to and far from tpoynqg, respec- 
tively. We modified the OS kernel to sample the number 
of CPU cores having runnable processes and to count the 
number of cores in the I/O wait state. Figure 7 shows 
the CPU and SSD usage of the two applications, where 
the entire CPU is regarded as busy if at least one of its 
cores is active. Similarly, the SSD is assumed busy if 
there are one or more cores in the I/O wait state. In the 
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cold start scenario, there is almost no overlap between 
CPU computation and SSD access for both applications. 
In the warm start scenario, the CPU stays fully active 
until the launch is completed as there is no wait. One ex- 
ception we observed is the time period marked with Cir- 
cle (a), during which the CPU seems to be in the event- 
waiting state. FAST is shown to be successful in overlap- 
ping CPU computation with SSD access as we intended. 
However, CPU usage is observed to be low at the begin- 
ning of launch for both applications, which can be ex- 
plained with the example in Figure 2. As Eclipse shows 
a shorter such time period (Circle (b)) than Firefox (Cir- 
cle (c)), teasr can reach closer to ftpoynqa. In the case of 
Firefox, however, the ratio of f¢py tO tssq is Close to 1:1, 
allowing FAST to achieve more reduction of launch time 
for Firefox than for Eclipse. 


Performance of sorted prefetch. Figure 6 shows that 
the sorted prefetch reduces the application launch time 
by an average of 7%, which is less efficient than FAST, 
but non-negligible. One reason for this improvement is 
the difference in I/O burstiness between the cold start 
and the sorted prefetch. Most SSDs (including the one 
we used) support the native command queueing (NCQ) 
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Figure 8: Simultaneous launch of multiple applications. 


feature, which allows up to 31 block requests to be sent 
to a SSD controller. Using this information, the SSD 
controller can read as many NAND flash chips as possi- 
ble, effectively increasing read throughput. The average 
queue depth in the cold start scenario is close to 1, mean- 
ing that for most of time there is only one outstanding 
request in case of SSD. In contrast, in the sorted prefetch 
scenario, the queue depth will likely grow larger than 
1 because the prefetcher may successively issue asyn- 
chronous I/O requests using posix_fadvise (), at small 
inter-issue intervals. 

On the other hand, we could not find a clear evidence 
that sorting block requests in their LBA order is advan- 
tageous in case of SSD. Rather, the execution time of 
the sorted prefetcher was slightly longer than its unsorted 
version for most of the applications we tested. Also, the 
sorted prefetch shows worse performance than the cold 
start for Excel, Powerpoint, Skype, and Word. Although 
these observations were consistent over repeated tests, a 
further investigation is necessary to understand such a 
behavior. 

Simultaneous launch of applications. We performed 
experiments to see how well FAST can scale up for 
launching multiple applications. We launched multiple 
applications starting from the top of Table 3, adding five 
at a time, and measured the launch completion time of 
all launched applications”. Figure 8 shows that FAST 
could reduce the launch completion time for all the tests, 
whereas the sorted prefetch does not scale beyond 10 ap- 
plications. Note that the FAST improvement decreased 
from 20% to 7% as the number of applications increased 
from 5 to 20. 

Runtime and space overhead. We analyzed the run- 
time overhead of FAST for seven possible combinations 
of running processes, and summarized the results in Ta- 
ble 4. Cases 2 and 3 belong to the launch profiling phase, 
which was described in Section 4.4. During this phase, 
Case 2 occurs only once, and Case 3 occurs Nrawseg 
times. Case 4 corresponds to the prefetcher generation 
phase (the right side of Figure 3), and shows a relatively 
long runtime. However, we can hide it from users by run- 
ning it in background. Also, since we primarily focused 
on functionality in the current implementation, there is 


Except for Gnome that cannot be launched with other applications, 
and Houdini whose license had expired. 


Table 4: Runtime overhead (application: Firefox) 


Running processes Runtime (sec) 


. Application only (cold start scenario) 
. Strace + blktrace + application 
. blktrace + application 


. Prefetcher generation 

. Prefetcher + application 

. Prefetcher + blktrace + application 
. Miss ratio calculation 





room for further optimization. Cases 5, 6, and 7 belong 
to the application prefetch phase, and repeatedly occur 
until the application prefetcher is invalidated. Cases 6 
and 7 occur only when np;er reaches N.pz, and Case 7 
can be run in background. 


FAST creates temporary files such as system call log 
files and I/O traces, but these can be deleted after FAST 
completes creating application prefetchers. However, the 
generated prefetchers occupy disk space as far as ap- 
plication prefetching is used. In addition, application 
launch sequences are stored to check the miss ratio of 
the corresponding application prefetcher. In our exper- 
iment, the total size of the application prefetchers and 
application launch sequences for all 22 applications was 
7.2 MB. 


FAST applicability. While previous examples clearly 
demonstrated the benefits of FAST for a wide range of 
applications, FAST does not guarantee improvements for 
all cases. One such a scenario is when a target appli- 
cation is too small to offset the overhead of loading the 
prefetcher. We tested FAST with the Linux utility uname, 
which displays the name of the OS. It generated 3 I/O re- 
quests whose total size was 32 KB. The measured tjoig 
was 2.2 ms, and trasr was 2.3 ms, 5% longer than the 
cold start time. 


Another possible scenario is when the target applica- 
tion experiences a major update. In this scenario, FAST 
may fetch data that will not be used by the newly up- 
dated application until it detects the application update 
and enters a new launch profiling phase. We modified 
the application prefetcher so that it fetches the same size 
of data from the same file but from another offset that 
is not used by the application. We tested the modi- 
fied prefetcher with Firefox. Even in this case, FAST 
reduced application launch time by 4%, because FAST 
could still prefetch some of the metadata used by the ap- 
plication. Assuming most of the file names are changed 
after the update, we ran Firefox with the prefetcher for 
Gimp, which fetches a similar number of blocks as Fire- 
fox. In this experiment, the measured application launch 
time was 7% longer than the cold start time, but the per- 
formance degradation was not drastic due to the internal 
parallelism of the SSD we used (10 channels). 


FAST 11: 9th USENIX Conference on File and Storage Technologies 


269 


270 


Configuring application launch manager. The appli- 
cation launch manager has a set of parameters to be con- 
figured, as shown in Table 2. If Nyawseq 18 set too large, 
users will experience the cold-start performance during 
the initialization phase. If it is set too small, unnecessary 
blocks may be included in the application prefetcher. 
Figure 5 shows that setting it between 2 and 4 is a good 
choice. The proper value of N,,, will depend on the run- 
time overhead of Blktrace; if FAST is placed in the OS 
kernel, the miss ratio of the application prefetcher may be 
checked upon every launch (N,y4 = 1) without noticeable 
overhead. Also, setting Rynjiss to 0.1 is reasonable, but it 
needs to be adjusted after gaining enough experience in 
using FAST. To find the proper value of Tjgie, we investi- 
gated the SSD’s maximum idle time during the cold-start 
of applications, and found it to range from 24 ms (Thun- 
derbird) to 826 ms (Xilinx ISE). Hence, setting Tjgie to 2 
seconds is proper in practice. As the maximum cold-start 
launch time is observed to be less than 10 seconds, 30 
seconds may be reasonable for Thimeour. All these values 
may need to be adjusted, depending on the underlying 
OS and applications. 
Running FAST on HDDs. To see how FAST works on 
a HDD, we replaced the SSD with a Seagate 3.5” 1 TB 
HDD (ST31000528AS) and measured the launch time of 
the same set of benchmark applications. Although FAST 
worked well as expected by hiding most of CPU com- 
putation from the application launch, the average launch 
time reduction was only 16%. It is because the applica- 
tion launch on a HDD is mostly I/O bound; in the cold 
start scenario, we observed that about 85% of the appli- 
cation launch time was spent on accessing the HDD. In 
contrast, the sorted prefetch was shown to be more ef- 
fective; it could reduce the application launch time by an 
average of 40% by optimizing disk head movements. 
We performed another experiment by modifying the 
sorted prefetch so that the prefetcher starts simultane- 
ously with the original application, like FAST. However, 
the resulting launch time reduction was only 19%, which 
is worse than that of the unmodified sorted prefetch. The 
performance degradation is due to the I/O contention be- 
tween the prefetcher and the application. 


6 Applicability of FAST to Smartphones 


The similarity between modern smartphones and PCs 
with SSDs in terms of the internal structure and the us- 
age pattern, as summarized below, makes smartphones a 
good candidate to which we can apply FAST: 


e Unlike other mobile embedded systems, smart- 
phones run different applications at different times, 
making application launch performance matter 
more; 
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Figure 9: Measured application launch time on iPhone 4 
(CPU: 1 GHz, SDRAM: 512 MB, NAND flash: 32 GB). 


e Smartphones use NAND flash as their secondary 
storage, of which the performance characteristics 
are basically the same as the SSD; and 


e Smartphones often use slightly customized (if not 
the same) OSes and file systems that are designed 
for PCs, reducing the effort to port FAST to smart- 
phones. 


Furthermore, a smartphone has the characteristics that 
enhance the benefit of using FAST as follows: 


e Users tend to launch and quit applications more fre- 
quently on smartphones than on PCs; 


e Due to relatively smaller main memory of a smart- 
phone, users will experience cold start performance 
more frequently; and 


e Its relatively slower CPU and flash storage speed 
may increase the absolute reduction of application 
launch time by applying FAST. 


Although we have not yet implemented FAST on a 
smartphone, we could measure the launch time of some 
smartphone applications by simply using a stopwatch. 
We randomly chose 14 applications installed on the 
iPhone 4 to compare their cold and warm start times, of 
which the results are plotted in Figure 9. The average 
cold start time of the smartphone applications is 6.1 sec- 
onds, which is more than twice of the average cold start 
time of the PC applications (2.4 seconds) shown in Fig- 
ure 6. Figure 9 also shows that the average warm start 
time is 63% of the cold start time (almost the same ra- 
tio as in Figure 6), implying that we can achieve similar 
benefits from applying FAST to smartphones. 


7 Comparison of FAST with Traditional 
Prefetching 


FAST is a special type of prefetching optimized for appli- 
cation launch, whereas most of the traditional prefetch- 
ing schemes focus on runtime performance improve- 
ment. We compare FAST with the traditional prefetching 
algorithms by answering the following three questions 
that are inspired by previous work [32]. 
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7.1 What to Prefetch 


FAST prefetches the blocks appeared in the application 
launch sequence. While many prediction-based prefetch- 
ing schemes [9, 23, 39] suffer from the low hit ratio of 
the prefetched data, FAST can achieve near 100% hit 
ratio. This is because the application launch sequence 
changes little over repeated launches of an application, 
as observed by previous work [4, 18, 34]. 

Sequential pattern detection schemes like readahead 
[13, 31] can achieve a fairly good hit ratio when acti- 
vated, but they are applicable only when such a pattern 
is detected. By contrast, FAST guarantees stable perfor- 
mance improvement for every application launch. 

One way to enhance the prefetch hit ratio for a com- 
plicated disk I/O pattern is to analyze the application 
source code to extract its access pattern. Using the thus- 
obtained pattern, prefetching can be done by either in- 
serting prefetch codes into the application source code 
[29, 38] or converting the source code into a computa- 
tion thread and a prefetch thread [40]. However, such 
an approach does not work well for application launch 
optimization because many of the block requests gener- 
ated during an application launch are not from the ap- 
plication itself but from other sources, such as loading 
shared libraries, which cannot be analyzed by examin- 
ing the application source code. Furthermore, both re- 
quire modification of the source code, which is usually 
not available for most commercial applications. Even 
if the source code is available, modifying and recompil- 
ing every application would be very tedious and incon- 
venient. In contrast, FAST does not require application 
source code and is thus applicable for any commercial 
application. 

Another relevant approach [6] is to deploy a shadow 
process that speculatively executes the copy of the orig- 
inal application to get hints for the future I/O requests. 
It does not require any source modification, but con- 
sumes non-negligible CPU and memory resources for the 
shadow process. Although it is acceptable when CPU 
is otherwise stalled waiting for the I/O completion, em- 
ploying such a shadow process in FAST may degrade ap- 
plication launch performance as there is not enough CPU 
idle period as shown in Figure 7. 


7.2 When to Prefetch 


FAST is not activated until an application is launched, 
which is as conservative as demand paging. Thus, un- 
like prediction-based application prefetching schemes 
[12, 28], there is no cache-pollution problem or addi- 
tional disk I/O activity during idle period. However, once 
activated, FAST aggressively performs prefetching: it 
keeps on fetching subsequent blocks in the application 


launch sequence asynchronously even in the absence of 
page misses. As the prefetched blocks are mostly (if not 
all) used by the application, the performance improve- 
ment of FAST is comparable to that of the prediction- 
based schemes when their prediction is accurate. 


7.3 What to Replace 


FAST does not modify the replacement algorithm of 
page cache in main memory, so the default page replace- 
ment algorithm is used to determine which page to evict 
in order to secure free space for the prefetched blocks. 

In general, prefetching may significantly affect the 
performance of page replacement. Thus, previous work 
[5, 14, 35] emphasized the need for integrated prefetch- 
ing and caching. However, FAST differs from the tradi- 
tional prefetching schemes since it prefetches only those 
blocks that will be referenced before the application 
launch completes (e.g., in next few seconds). If the page 
cache in the main memory is large enough to store all 
the blocks in the application launch sequence, which is 
commonly the case, FAST will have minimal effect on 
the optimality of the page replacement algorithm. 


8 Conclusion 


We proposed a new I/O prefetching technique called 
FAST for the reduction of application launch time on 
SSDs. We implemented and evaluated FAST on the 
Linux OS, demonstrating its deployability and perfor- 
mance superiority. While the HDD-aware application 
launcher showed only 7% of launch time reduction on 
SSDs, FAST achieved a 28% reduction with no addi- 
tional overhead, demonstrating the need for, and the 
utility of, a new SSD-aware optimizer. FAST with a 
well-designed entry-level SSD can provide end-users the 
fastest application launch performance. It also incurs 
fairly low implementation overhead and has excellent 
portability, facilitating its wide deployment in various 
platforms. 
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Abstract 


Multi-tier systems that combine SSDs with SAS/FC and/or 
SATA disks mitigate the capital cost burden of SSDs, while 
benefiting from their superior I/O performance per unit cost and 
low power. Though commercial SSD-based multi-tier solutions 
are available, configuring such a system with the optimal num- 
ber of devices per tier to achieve performance goals at mini- 
mum cost remains a challenge. Furthermore, these solutions 
do not leverage the opportunity to dynamically consolidate load 
and reduce power/operating cost. 

Our extent-based dynamic tiering solution, EDT, addresses 
these limitations via two key components of its design. A Con- 
figuration Adviser EDT-CA determines the adequate mix of 
storage devices to buy and install to satisfy a given workload 
at minimum cost, and a Dynamic Tier Manager EDT-DTM per- 
forms dynamic extent placement once the system is running to 
satisfy performance requirements while minimizing dynamic 
power consumption. Key to the cost minimization of EDT-CA 
is its ability to simulate the dynamic extent placement afforded 
by EDT-DTM. Key to the overall effectiveness of EDT-DTM is 
its ability to consolidate load within tiers when feasible, rapidly 
respond to unexpected changes in the workload, and carefully 
control the overhead due to extent migration. Our results us- 
ing production workloads show that EDT incurs lower capital 
and operating cost, consumes less power, and delivers similar 
or better performance relative to SAS-only storage systems as 
well as other simpler approaches to extent-based tiering. 


1 Introduction 


Enterprise storage systems strive to provide performance 
and reliability at minimum capital and operating cost. 
These systems use high performance disk drives (e.g. 
SCSI/SAS/FC) to provide that performance. However, 
solid-state drives (SSDs) offering superior random ac- 
cess capability per GByte have become increasingly af- 
fordable. On the other hand, SATA drives offering supe- 
rior cost per GByte are also attractive for mass storage. 
Systems with only SSDs are still too expensive, and those 
built using only SATA would not provide enough perfor- 
mance/GByte for most enterprise workloads. Multi-tier 
systems containing a mix of devices can provide high 
performing and lower cost storage by utilizing SSDs only 
for the subset of the data that needs SSD performance. 
Current commercial SSD-based multi-tier systems 
from IBM [29], EMC [17], 3PAR [25] and Compel- 


lent [23] provide performance gains and cost savings. 
However, customer adoption has been slow. One of the 
reasons for this is the difficulty in determining what mix 
of devices will perform well at minimum cost in the 
customer’s data center. This optimization task is highly 
complex because of the number of device types available 
along with the variability of workloads in the data center. 


To address this challenge, two things are needed: con- 
figuration tools to assist in building such systems and to 
demonstrate potential benefits based on customer work- 
load, and capabilities in the storage systems that can opti- 
mize placement of data in the tiers of storage. The place- 
ment should ensure that actively accessed data is co- 
located to minimize latency while lightly accessed data is 
placed most economically. There is also an opportunity 
to improve operating cost by placing data on the min- 
imum set of devices that can serve the workload while 
powering down the rest. Current products address some 
but not all of these challenges. Determining which mix 
of devices to buy remains a difficult problem, and im- 
provement of operating cost by consolidation and power 
management has not yet been tackled. 


To address these gaps, we develop an Extent-based 
Dynamic Tiering (EDT) system that includes: 1) a 
Configuration Adviser tool EDT-CA to calculate cost- 
optimized mixes of devices that will service a customer’s 
workload, and 2) a Dynamic Tier Management EDT- 
DTM component that runs in the configured storage sys- 
tem to place data by dynamically moving extents (fixed- 
size portions of a volume) to the most suitable tiers 
given current workload. EDT-CA works by simulating 
the dynamic placement of extents within tiers that offer 
the lowest cost to meet an extent’s I/O requirements as 
they change over time, and thus suitably size each tier. 
EDT-DTM monitors active workload and manages ex- 
tent placement and migration in such a way that per- 
formance goals are met while optimizing operating cost 
where feasible by consolidating data into fewer devices 
within each tier and powering off the rest. 


We evaluated EDT-CA and EDT-DTM, using both 
production and synthetic workloads on a storage system 
with SSDs, SAS, and SATA drives. Our results show 
that multi-tier systems using EDT have a device mix that 
saves between 5% to 45% in cost, consume up to 54% 
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less peak power, and an additional 15-30% lower dy- 
namic power (instantaneous power averaged over time), 
at a better or comparable performance compared to a ho- 
mogeneous SAS storage system. EDT’s design choices 
are critical in achieving these savings. Dynamic extent 
placement saves 25% in cost compared to a static extent- 
based system. Including metrics in addition to IOPS in 
EDT’s placement provides a 2x performance improve- 
ment compared to a dynamic tiering system that allocates 
extents based on IOPS alone. 
Our work makes the following contributions: 


e EDT is the first publicly available work that formal- 
izes and explores the design space for storage con- 
figuration and dynamic tier management in SSD- 
based multi-tier systems. (Section 2) 

e EDT consists of a novel configuration algorithm for 
dynamic tiered systems that outputs lower cost con- 
figurations. (Sections 3, 4) 

e EDT proposes a novel dynamic placement algo- 
rithm to satisfy performance requirements while 
minimizing dynamic power. (Section 5) 

e EDT outperforms SAS-only and other simpler 
extent-based tiering approaches across a variety of 
workloads in both cost and power. (Section 6) 


2 Miulti-Tiering: Design Choices 


This section describes important design choices for a 
multi-tier system that enable efficient use of the tiers. 


2.1 Extent-based Tiering 


The first we consider the granularity of data placement. 
Previous studies [7, 11] suggest that I/O activity is highly 
variable across LBAs in a volume. Therefore, if data 
were placed at a volume level based on average volume 
workload characteristics, a large percentage of the tier 
will hold data that does not require the tier’s capabilities. 

Thus, we perform data placement at the granularity of 
an extent, a fixed-size portion of a volume. The smaller 
the extent size, the more efficient will be the data place- 
ment. However, operating at the extent level incurs meta- 
data overhead to keep track of extent locations and other 
Statistics and this overhead increases as extent size is de- 
creased. We choose an extent size with an acceptable 
system overhead (details in § 6.2). Note that we expect 
the extent size to be larger than the typical file system 
block size and hence extents are not expected to align 
with file boundaries. However, the reduced system over- 
head for larger extents provides the right tradeoff com- 
pared to finer grain approaches. 


2.2 Dynamic Tiering 


The next design choice deals with the time scale at which 
extents move across tiers. One choice involves placing 
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extents once during system instantiation or moving them 
at coarse grain intervals of the order of days or months. 
However, since studies show that I/O rates of a workload 
are typically below peak most of the time [16, 19], this 
static or semi-static placement is not optimal—a place- 
ment that configures for the peaks pays extra in both 
cost and energy for a system that is over-provisioned at 
off-peak times; and a placement that mitigates cost from 
over-provisioning by configuring for the average I/O rate 
suffers from decreased performance during peaks. 

The alternate choice is to plan extent movement at in- 
tervals on the order of minutes or hours. We refer to this 
time interval as an epoch. Such a system exploits varia- 
tion in extent I/O rate to improve its efficiency; an extent 
is on a SATA tier when fairly inactive, and moves to the 
SAS or SSD tier as its I/O rate goes up. This achieves 
cost-effective use of resources and/or dynamic energy 
savings. Similarly, when the performance demanded of 
a single tier is below its peak capacity, extents placed on 
the tier can be consolidated into fewer devices for power 
savings. Often, the set of heavily loaded extents changes 
over time [11]. Dynamic migration of the heavily loaded 
extents into SAS or SSD when required enables cost- 
effective use of the resources. Thus, we choose to per- 
form dynamic data placement with an epoch length of 
the order of minutes/hours. 

The drawback of such a dynamic system, however, is 
the cost of data migration, i.e., the potential adverse ef- 
fect on foreground I/O latency and the migration latency 
itself before the desired outcome. Longer epoch dura- 
tions allow more time to execute migrations and amortize 
overhead better. Thus, we pick an epoch duration whose 
estimated migration overhead is below the allowable sys- 
tem migration overhead (details in § 6.2). Additionally, 
it is important to ascertain that the overhead of migrat- 
ing data does not overwhelm its benefit. This depends on 
the stability of the workload—extents that relocate often 
benefit less from migration compared to extents that stay 
longer in a particular tier. The workloads we have stud- 
ied indicate that dynamic migration is typically benefi- 
cial, but we believe that a dynamic system must also be 
able to back off when lack of workload stability causes 
dynamic migration to interfere with performance. 


2.3 Beyond I/O Rate Based Tiering 


This design choice determines the extent-level statistics 
required to match an extent with the right tier. The avail- 
able public documentation about commercial extent- 
based multi-tier products indicates use of IOPS to mea- 
sure load; in these systems high IOPS regions are placed 
onto SSD while leaving the remainder of the data on SAS 
or SATA. Although this method is intuitively correct, 
our preliminary analysis reveals significant drawbacks: 
IOPS-based placement does not factor in the bandwidth 
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Figure 1: EDT system architecture. 


requirement of an extent. For example, consider an ex- 
tent with a long sequential access pattern consisting of 
small I/Os to contiguous locations. Such an extent will 
have high IOPS and bandwidth requirements. Our anal- 
ysis of production and SPC-1 [1] like workload traces 
(8 6), collected after the I/O scheduler show such pat- 
terns. Using I/O rate statistics for this stream causes se- 
quential streams, which are more cost-effectively served 
on SAS or even SATA, to be inappropriately placed on 
SSD. IOPS placement also ignores capacity of the ex- 
tent. An extent with high IOPS relative to other extents 
may not have high enough I/O density (IOPS/GByte) to 
justify the high $/GByte cost of the SSD. 

Our approach is to collect more than just I/O counts. 
We employ a heuristic as in [22] to break down an 
extent’s workload: I/Os that access LBAs within 512 
KBytes of the previous ones are taken as part of a se- 
quential stream and contribute to an extent’s bandwidth 
requirement. I/Os further apart are characterized as ran- 
dom Y/Os and are used to compute a random I/O rate. 
Thus, for each extent, we collect a random I/O rate and 
bandwidth. Other methods for separating the I/Os into 
random and sequential may also be applicable. 


3. EDT: Design Overview 


EDT consists of two elements as depicted in Figure 1: 
a Configuration Adviser (EDT-CA) that determines the 
right number of devices per tier to install into a storage 
system, and a Dynamic Tier Manager (EDT-DTM) that 
operates inside a running system and continuously man- 
ages extent placement across tiers. EDT is expected to 
be deployed in a commercial storage system as shown 
in Figure | which exports many volumes, includes a vir- 
tualization layer that allows volumes to be made up of 
extents stored in arrays of different device types, is ca- 
pable of collecting and exporting statistics about extent 
workloads, and can execute requests to non-disruptively 
move extents between storage devices. 


An example usage scenario is as follows: A user 


wishes to replace a SAS based storage array with a new, 
tiered storage system with twice the capability. He col- 
lects a trace of his workload over a 24 hour period that 
he thinks is representative. The trace is then run through 
EDT-CA which produces the minimum cost configura- 
tion of SSD, SAS, and SATA that can provide 2x the 
performance of the existing system. EDT-CA is aware 
of the runtime migration capabilities of EDT-DTM and 
takes them into account when determining the configura- 
tion. The user installs the new system. During operation 
of the new system, EDT-DTM manages migration be- 
tween tiers by continuously collecting extent level statis- 
tics, consolidates data onto lower-power tiers when pos- 
sible, and monitors the system to ensure that the work- 
load performance is not throttled. 


In general, EDT-CA starts by determining the work- 
load requirements for the system it is going to configure. 
This can either be done with a user generated general 
description of requirements including IOPS, seq/random 
mix, length of I/O requests, and their distribution across 
extents, or by using time series data collected from a 
workload running on an existing system. For the scope 
of this work, we assume availability of time series statis- 
tics. In this approach, EDT-CA takes a epoch-granularity 
trace of extent workload statistics sampled at times when 
storage system usage is high. It then estimates the re- 
sources required in different tiers to satisfy that workload 
by simulating placement of each extent in a tier that min- 
imizes its incurred cost while meeting its performance 
requirements. It repeats this process every epoch and as- 
signs extents to their lowest cost tier based on their per- 
formance requirements in that epoch. At the end of this 
simulation, EDT-CA determines the set of devices that 
are needed based on the maximum number of devices 
needed in each tier over all the epochs. This configura- 
tion determines the set of devices purchased by the user. 


Once the new tiered system is up and running, EDT- 
DTM manages extent placement. It collects extent level 
Statistics, estimates extents’ resource consumption in dif- 
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ferent tiers, and then plans and executes migrations. 
EDT-DTM implements a throttling correction mecha- 
nism to ensure that performance requirements are sat- 
isfied as they vary over time; it constantly monitors ar- 
ray performance and if performance throttling is detected 
relocates extents to restore performance. EDT-DTM’s 
placement algorithm seeks to place each extent into the 
lowest-energy tier that satisfies its performance require- 
ment and then to further minimize energy by consolidat- 
ing extents in the same tier into fewer devices allowing 
unused devices to be powered down. Both these algo- 
rithms use a Migrator module to move extents. 

EDT-CA and EDT-DTM work together to minimize 
cost. EDT-CA minimizes acquisition cost, and EDT- 
DTM minimizes operating cost. As our results will show, 
configurations based on static extent placement are more 
expensive both to acquire and operate. 


3.1 Common Components 


EDT-CA and EDT-DTM share components that collect 
statistics and calculate resource consumption. 


3.1.1 Data Collector 


The Data Collector receives information about I/O com- 
pletion events including the transfer size, response time, 
logical block address (LBA) , the volume ID to which 
the I/O was issued, and the array which executed the 
I/O. The collector then maps the (LBA, volume id) pair 
of each I/O to a unique extent in the system, and com- 
piles for each extent, the number of random I/Os and the 
number of transferred bytes. It then periodically (every 
minute in our implementation) computes instantaneous 
bandwidth and random IOPS per extent as well as an 
exponentially-weighted moving average. In addition to 
the extent statistics, the collector aggregates statistics per 
array. It maps each I/O to its array and compiles its 
IOPS and average response time. These measurements 
are used by EDT-DTM to determine if I/Os on an array 
are being throttled. For a very large system the amount 
of data collected by the data collector may be significant. 
If this is an issue, the the extent size can be made larger 
to reduce the volume of statistical data. 


3.1.2 Resource Consumption Model 


The Resource Consumption Model uses the extent statis- 
tics to estimate the resources it consumes when placed on 
a device of a given type. Resources are allocated based 
on the observed capacity and performance requirements 
at the device level. Therefore, any workload optimiza- 
tions like deduplication, compression, and caching do 
not need to be considered in these models as their effects 
will be captured by the usage statistics. 

An extent consumes the resources of a device along 
capacity and performance dimensions. Consider an ex- 
tent of size E, and a performance requirement EF, deter- 
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Figure 2: Lowest cost tier for extents with different 
characteristics. 


mined by its random IOPS rate (RJOR) and bandwidth 
measured in previous epochs. The fraction of capacity 
required to host an extent E in device D (RC(E¢,D)) is 
straightforward: 
RC(E,D) = Capacity reqs’ by extei 
Total space in device 

For performance utilization, we use a simplified model 
based on Uysal et al.’s work [30]. The performance re- 
source consumption of extent E, when placed on device 
D (RC(Ep,D)) is: 

RC(E,,D) = RIOR-Rtime + Bandwidth - Xtime 

Here RIOR is the number of random I/Os sent to an 
extent in a second (IO/s) and Rtime is the expected re- 
sponse time of the device (s/IO). Bandwidth is the band- 
width requested from the device (MB/s), and Xtime is the 
average transfer time (s/MB). The result of this equation 
is the fraction of the device performance utilized by an 
extent. Note that the Rtime and Xtime values are av- 
erages and may need to be adjusted depending on the 
expected workload. For example an SSD with a mostly 
random write workload would have significantly higher 
Rtime than the same SSD with a mostly random read 
workload. The overall resource required by an extent is 
then the maximum of the capacity utilization fraction and 
the performance utilization fraction: 

RC(E,D) = max(RC(E»p,D),RC(Ec,D)) 

The resource consumption model determines the most 
efficient tier for an extent. For instance, when minimiz- 
ing cost, the most suitable tier is the one where the extent 
incurs the lowest cost (the product of the device cost and 
the extent’s resource consumption on that device). Fig- 
ure 2 confirms the advantage of multi-tier systems since 
the most cost-effective tier changes with extent charac- 
teristics, namely the total IOPS and the percentage of 
sequential accesses among three classes of storage de- 
vices specified in Section 6. As expected, we observe 
that mostly idle extents favor SATA, medium IOPS favor 
SAS, and high IOPS favor SSD. Further, as expected, 
more sequential extents favor HDDs. 
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4 Configuration Adviser 


EDT-CA builds on the Data Collector and the Resource 
Consumption Model described above. Since configura- 
tion is an NP-Hard packing problem, we propose a light- 
weight heuristic to achieve low cost extent placement: 

1. Binning. For each extent E, and device type D, we 
compute the cost of allocating the extent to that de- 
vice as extent cost(E, D) = cost(D)- RC(E,D). The 
extent is then placed in the tier that meets its per- 
formance with the lowest cost. Iterating over all the 
extents, the above computation separates the extents 
into bins, one per each tier. 

2. Sizing a bin. For each bin, we obtain its perfor- 

mance and capacity resource consumption as RC, = 
YRC(Ep,D) VE, and RC, = Y.RC(Ec,D) VE. 
The maximum of these two values gives the total 
bin resources required, and the number of required 
devices of this bin type are computed by rounding 
up this sum to the nearest integer value. 

3. This process is independently repeated for each 
epoch to identify the number of devices per tier that 
yields minimum cost for that epoch. 

4. The last step consists of combining these differ- 
ent configurations to obtain a final system config- 
uration valid across time. For the scope of this 
work, we achieve the final configuration by allocat- 
ing the maximum number of devices of each type 
used across all epochs. That is, if at epoch fo 2 de- 
vices of type D and 1 of type D’ are the most cost 
effective, but at epoch f, 1 of type D and 2 of D’ is 
better, then our method will indicate that we need 2 
of type D and 2 of D’. 

Our current method of combining configurations 
across epochs is fairly conservative and could potentially 
result in an over-provisioned system. However, as our 
current algorithm already results in lower cost configura- 
tions (Section 6), we relegate exploring more efficient 
ways of combining configurations over time to future 
work. Also note that when we compute tiered configu- 
ration for each epoch independently, we assume that the 
extents can be suitably migrated between epochs if re- 
quired. As part of our future work, we intend to model 
the required number of migrations, and suitably adjust 
the provisioning if the required migrations exceed the 
maximum number of migrations a system can support in 
a chosen interval of time. Finally, our Configuration Al- 
gorithm can also be used to upgrade a multi-tier system 
to meet upcoming performance demands. 


5 Dynamic Tier Manager 


EDT-DTM combines three new modules with the Data 
Collector and the Resource Consumption Model to 
continuously optimize extent placement: (1) a Tier- 


ing and Consolidation module, (2) a Throttling Detec- 
tor/Corrector module, and (3) a Migrator module. 


5.1 Tiering and Consolidation Algorithms 


At the end of every epoch, the Tiering and Consolidation 
(TAC) algorithms generate an extent placement to satisfy 
extent performance requirements and minimize dynamic 
system power. Such an energy efficient placement can 
be achieved both by leveraging the strengths (i.e. per- 
formance or capacity per watt) of the heterogeneous un- 
derlying hardware (SSD, SAS, and SATA drives), and by 
consolidating data into fewer devices when possible and 
turning off the unused devices. 

Similar to the configuration problem, placement for 
power minimization is also NP-Hard, and we propose a 
heuristic solution. TAC requires two inputs: (1) current 
random I/O rate and bandwidth for each extent from the 
actively running system, and (2) size (in bytes) and the 
random I/O rate and bandwidth capability for each array 
in the storage system. It then uses a two-step process to 
output a new extent placement that aims to adapt to the 
changes in the workload as follows: 

(1) Tiering. For each extent E, and device type D, 
we compute the “fractional power burden” of allocat- 
ing the extent to that device as extent power(E, D) = 
power(D)-RC(E,D). The extent is then placed on the tier 
that meets its performance with the lowest power con- 
sumption. Doing so allows EDT to reduce active power 
via consolidation (described next). Iterating over all the 
extents results in one bin per tier. The assignment of ex- 
tents to a tier is performed locally on an extent by ex- 
tent basis, irrespective of the total performance needs or 
available space in that tier. 

(2) Consolidation. Extents assigned to each tier are then 
sorted using their RC values and placed in arrays using 
the First Fit Decreasing heuristic, a good approximation 
algorithm to the optimal solution for extent packing [35]. 
When extents already assigned to the tier under consid- 
eration exceed its available performance (i.e., resource 
consumption metric for the assigned extents exceeds 1) 
or the tier runs out of space in the available arrays, the re- 
maining extents in the extent list are demoted to the tier 
with the next lower power burden for that extent. This 
packing process is now repeated for all the tiers, con- 
solidating extents into a minimum number of arrays in a 
tier. Extents already in the right tier and on an array that 
will remain powered on in this epoch retain their posi- 
tion from the previous epoch, thereby saving migrations. 
Any unused arrays from the extent placement are set to a 
lower power state to conserve energy. 


5.2 Throttling Detector and Corrector 


While the TAC mechanisms enable dynamic perfor- 
mance and power optimization, unexpected load and 
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working set changes can suddenly alter the performance 
requirements of extents. However, tracking this perfor- 
mance change, especially when an extent’s I/O rate in- 
creases, is challenging. Extents placed in a low perfor- 
mance tier cannot exhibit high I/O rates even when the 
application above may desire it. This causes throttling of 
the true IOPS requirement of the extent, artificially limit- 
ing it to a low value. The Throttling Detector overcomes 
this limitation by monitoring the average response time 
of each active array every minute. 

If the average response time of I/Os from an array in- 
dicates that undesirably high request queuing is occur- 
ring in the array, EDT decides that the array is throttling 
the true IOPS requirement of applications and causing 
delays. When throttling is detected, pending migrations 
driven by TAC are immediately halted and EDT-DTM 
switches to a throttling correction mode to perform re- 
covery. To respond rapidly and minimize the possibil- 
ity of future throttling in the same array, the load on the 
throttled array is shed by migrating a minimum set of 
extents responsible for at least half of its current total 
performance resource consumption. 

To select the target array(s), we first start by consider- 
ing the best possible tier for each extent being migrated, 
and within that tier we first examine arrays which are 
already active to see if they can absorb the new extent. 
If none can host the new extent, we consider arrays that 
are not in use in that tier if any are available. If the best 
tier can not accommodate the extent we try the same ap- 
proach on tiers with the next higher power burden for 
that extent. If the array continues to remain throttled 
after half the load on the array has been migrated, the 
extent migration process is repeated, until the system is 
no longer throttled. The entire system stays in recov- 
ery mode while an array remains throttled, suspending 
energy optimizing migrations. When no arrays are throt- 
tled, the system switches back to the TAC placement af- 
ter an epoch elapses. 


5.3. Migrator 


The Migrator handles the data movement requests from 
TAC and the throttling algorithms. It compares the new 
placement of the extents from the above algorithms to 
their old placement, and identifies extents that need to 
be migrated. It then schedules and optimizes these mi- 
grations. On one hand, migrations that relieve throt- 
tling must be completed quickly. On the other hand, mi- 
grations cause additional I/O traffic, and care must be 
taken so that they do not affect the foreground I/O per- 
formance. 

Our migration scheme achieves this tradeoff as fol- 
lows. We allow every device to be involved in only one 
migration operation at a time. Thus, before issuing a mi- 
gration request, the Migrator performs admission con- 
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Figure 3: Storage subsystem platform for evaluating 
EDT-CA and EDT-DTM. 


trol by allowing requests only if the source and target 
device are both available. If they are not, the request is 
re-queued and it moves onto the next request. Further, 
the Migrator controls its migration-related resource con- 
sumption by decomposing an extent into smaller transfer 
units and pacing the transfer requests to match the min- 
imum of the available or the desired I/O rate. Further 
if the migration is being performed to relieve throttling, 
once a transfer unit is migrated, any foreground I/O re- 
quests to it are handed by the destination array. Note that 
because of this pacing not all planned migrations may 
be completed before the next epoch. In such cases, the 
migration queue is flushed, and requests resulting from 
the new epoch’s computation are queued. We further 
optimize by retaining the old location of the extent if it 
is already in the right tier during the consolidation step. 
Finally, we could potentially incorporate other optimiza- 
tions [4, 9, 31, 36] such as multiple locations for the same 
extent [31], and proactive migrations [36]. 


6 Evaluation 


Our evaluation uses both a SPC-1-like [1] benchmark 
workload and multiple production enterprise workloads 
from MSR [21] to demonstrate that: 


e In comparative evaluation, EDT-CA works to mini- 
mize cost, and EDT-DTM satisfies performance re- 
quirements while lowering power consumption. 

e EDT’s dynamic behavior and detailed resource con- 
sumption model help achieve its goal. 

e Extent based dynamic optimization and consolida- 
tion are feasible in practice with little overhead. 


6.1 Methodology 


Comparison candidates. We compare EDT to three al- 
ternate solutions: 
1. SAS is chosen to represent current enterprise storage 
system deployments that predominantly use only high 
performance SAS drives. The configuration is derived 
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. _|Cost| Power |Random| BW |Rtime| Xtime 
Device i 

($) | dle, Active)} IOPS | (MB/s) | (ms/I0) | (ms/KB) 

SSD | 430 0.5, 1 5000 90 0.2 0.01 

SAS | 325 | 12.4, 17.3 290 200 | 3.75 | 0.004 

SATA | 170 | 8.0, 11.6 135 105 9 0.009 





























Table 1: Characteristics of devices used in the testbed. 


using the capacity and peak performance (IOPS and 
bandwidth) requirements of the workload. Volumes 
are statically assigned to SAS arrays in a load-balanced 
manner. 


2. EST (Extent-based Static Tiering) places extents on 
tiers statically to quantify the benefit from tiering. Con- 
figuration is performed as follows: at every epoch, the 
cost to place each extent on each tier is computed as 
done by EDT-CA using capacity, IOPS, and bandwidth 
requirements. An extent is then permanently placed 
on the tier that minimizes the sum of its instantaneous 
costs over all epochs. Once extents are binned into 
tiers, the number of devices for each tier is determined 
using that tier’s peak resource consumption. 


3. While SAS and EST illustrate the benefit from 
EDT’s design choices incrementally (going from a ho- 
mogeneous system to static tiering and then to dynamic 
tiering), we propose a third candidate to illustrate a dif- 
ferent design decision in dynamic multi-tier systems— 
IDT (IOPS Dynamic Tiering) implements extent-based 
dynamic configuration and placement using a greedy 
IOPS-only criteria where higher IOPS extents move to 
higher IOPS tiers. This is in contrast to EDT that uses 
a combination of capacity, IOPS, and bandwidth in its 
placement algorithm. 


Implementation. Our test system is shown in Fig- 
ure 3. In addition to EDT, we implemented an I/O dis- 
patcher that receives block I/O requests from applica- 
tions, maps the logical block address to the physical de- 
vice address, performs the corresponding I/Os, and com- 
municates with the EDT components. Our trace player 
application issues block I/Os from a trace via a socket 
to the I/O dispatcher. To support real-world applica- 
tions without modification, we implemented a pseudo 
block device interface. For the scope of this work, we 
use Linux’s default deadline scheduler, and our measure- 
ment of context switch overhead when running through 
the pseudo device driver was negligible (< 10uUs). 

Experimental Testbed. Our experimental platform 
consists of an IBM x3650 with 4 Intel Xeon cores and 4 
GB memory acting as the I/O dispatcher. It is connected 
via internal and external SAS ports to 12 1 TB 7200 rpm 
3.5” SATA drives, 12 450 GB 15K rpm 3.5” SAS drives, 
and 4 180 GB Intel X25-M SSD drives. Table 1 shows 
the characteristics of these devices. The enclosures con- 


taining the drives are connected to a Watts up? Pro power 
meter. We report the disk power obtained by subtracting 
the baseline power used by the non-disk components of 
the enclosure (154 W). 

Metrics. To compare solutions, we evaluate static con- 
figuration results using capital cost and peak power con- 
sumption, and we evaluate dynamic behavior using the 
average and distribution of I/O latency along with dy- 
namic power consumption. Peak power consumption is 
obtained using disk drive data sheets. Dynamic power 
consumption is measured using the power meter. 


6.2 Parameter Selection 


Extent size. Smaller extents use tier and migration- 
related resources more efficiently and enable faster re- 
sponse to workload changes, but also incur greater meta- 
data overhead. Our approach was to pick the smallest 
extent size that incurs acceptable metadata overhead. As- 
suming metadata can have a reasonably small overhead 
of at most 0.0001% of the total storage capacity, and 
given 200 bytes/extent for metadata overhead (mostly 
from recording extent-level statistics) in our implemen- 
tation, the smallest extent size our storage system can 
support is 20 MB. To introduce some slack we used 64 
MB extents for our experiments. 

Epoch duration. Shorter epochs allow quicker response 
to workload changes, but can also result in increased ex- 
tent migration. As the epoch duration increases, the sta- 
bility of extent characteristics increases due to averag- 
ing over longer periods and consequently the migration 
bandwidth overhead decreases. We picked epoch dura- 
tions that resulted in migration bandwidth limited to a 
10% fraction of the available array-pair bandwidth in the 
system!. This prevents migration from significantly de- 
grading performance and ensures that migrations com- 
plete early within each epoch. For the MSR workloads 
this calculation resulted in a 30 minute epoch. 


6.3 Synthetic Workload 


This SPC1-like workload was chosen because it simu- 
lates an industry standard benchmark and provides a con- 
trast to the MSR trace workloads. We ran the SPC 1-like 
workload generator ona | TB volume at 100 BSUs for 30 
min using an over-provisioned configuration (a 12 SAS 
RAID-O array). We chose 30 min because the workload 
is quite static after a short startup period. The resulting 
trace was used to obtain the number of devices required 
per tier for different methods (Table 2 ). 

We observe that all the extent-based tiering configura- 
tions outperform SAS configurations in both capital cost 
and peak power consumption. EDT reduces cost by 14%, 
and peak power by 55% compared to SAS. Cost incurred 


'Medium to large scale tiered storage systems would typically per- 
form simultaneous extent migrations across multiple array-pairs. 
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System | # of Disks | Energy | Cost | Avg RT 
SAS | (0,6,0) | 103.8 W)/$1950} 28 ms 
EST | (2,2,1) | 46.6 W [$1680] 15 ms 
IDT | (2,1, 1) | 29.3 W |$1355| 21 ms 
EDT | (2,2,1) | 46.6 W [$1680] 15 ms 























Table 2: Configuration for synthetic workload. The 
number of disks per tier is specified as (SSD, SAS, 
SATA). The average response time is obtained from 
running the configuration with 100 BSUs . 


to configure EST and EDT for this relatively static work- 
load are similar. Although the IDT configuration seems 
to provide the least cost configuration, this is an artifact 
of rounding up required devices to the next higher inte- 
ger. Using fractional devices, costs for EDT and IDT are 
much closer ($890 vs. $920). Note that in larger systems 
rounding effects will be less significant. 

To confirm that EDT’s lower cost is not at the expense 
of performance, we ran the SPC1-like workload for 30 
minutes at 100 BSUs. Given the stability of the work- 
load, migration overhead was minimal. We therefore 
chose an epoch of 5 minutes to complete the experiments 
quickly. The SAS scheme used 6 SAS RAID-O array. 
Other schemes operated on individual disks. We started 
EDT and IDT with the entire volume in the SATA tier 
and allowed dynamic extent migration to reach optimal 
configurations over time. EST, which does not support 
extent migration, was started with extents in their most 
suitable locations as per the EST configuration. 

The last column of Table 2 shows the average response 
times for 100 BSUs measured starting at the end of the 
first epoch, once the extent placements of the dynamic 
tiering configurations become effective. Given the work- 
load’s stability, results for EDT and EST are identical. 
They both achieve a 40% lower response time compared 
to SAS, and improve on IDT’s IOPS only placement by 
20%. Note that the dynamic power consumption in these 
experiments is similar to the peak power due to the lack 
of workload variation. 


6.4 Production Workload 


Our next workload (MSR-combined) represents the more 
interesting class of real-world workloads, obtained by 
combining the I/Os to the 31 (out of 36) most active vol- 
umes of a production storage system [21] for a total of 
4580 GB. Including the remaining 5 volumes was not 
feasible given the hardware restrictions of our testbed. 

Configuration outcomes. Configuration outcomes 
based on six days of the MSR-combined workload, 
shown as the “Equal Performance” group in Table 3, 
indicate that the tiering configurations have lower cost 
compared to SAS. EDT incurs the lowest cost (50% re- 
duction compared to SAS and 25% relative to EST). 
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Config System | #of Disks | Energy Cost 
SAS (0, 16, 0) 276.8 W | $5200 

Equal EST (5, 2, 4) 82 W $3480 
Performance IDT (4, 1, 4) 64.5 W | $2725 
EDT | (@,2,4) | 81.6W | $2620 

SAS (0, 12, 0) 204 W $3900 

EST (4, 4, 4) 116 W $3700 

Equal Cost |—Tyr |, 4,4) | 116W | $3700 
EDT | ,4,4)_| llow | $3700 























Table 3: Configuration for MSR-combined. Configu- 
rations achieving equal performance depict improve- 
ment in cost and peak power. Configurations at equal 
cost are created for experimental ease. Number of 
disks in each tier specified as (SSD, SAS, SATA). 


EDT?’s ability to effectively time share high-cost, high- 
performance tiers across extents and satisfy sequentially 
accessed ones with the SAS tier (instead of the SSD 
tier) results in more cost-effective configurations. Ex- 
tents placed in the SATA tier (4336 GB) are mostly idle 
with random IOPS below 0.32, those in SAS (69 GB) are 
dominated by bandwidth higher than 1.45 MB/s and ran- 
dom IOPS less than 1.43, and the SSD extents (175 GB) 
have random IOPS between 1.45 and 858. Tiered config- 
urations substantially reduce peak power when compared 
with SAS; IDTs greater use of the SSD tier (relative to 
SAS) makes it the most power-efficient. 
Performance and Power outcomes. Not all of the equal 
performance configurations listed in Table 3 were fea- 
sible on our experimental testbed due to hardware lim- 
itations. Consequently, we decided to switch to equal 
cost configurations (shown in Table 3) to contrast per- 
formance at equal cost instead of cost at equal perfor- 
mance only for the MSR-combined workload. Later, we 
shall explore equal performance configurations for feasi- 
ble subsets of volumes (Figure 6). EDT’s configuration 
was chosen as the base for all the tiering systems, and 
its configuration requirements were rounded up to inte- 
ger number of arrays, each array consisting of 4 devices. 
SAS used only SAS drives for the same cost, split into 
4 disk RAID 0 arrays. We then replayed day one from 
the seven day trace, the most active 24 hour period of 
the MSR-combined workload. Both EDT and IDT were 
bootstrapped using a load balanced volume placement. 
Figure 4 summarizes the results of this experiment for 
the candidate solutions. First, we notice that the I/O re- 
sponse time distribution of EDT is clearly superior to the 
other three solutions, highlighting the importance of con- 
sidering random IOPS, bandwidth, and capacity when 
making tiering choices. The average response time with 
EDT was 2.94 ms while those for the SAS, EST, and 
IDT were 5.12, 9.33, and 5.93 ms respectively. Further, 
the 95” percentile response time for EDT was under 7.86 
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Figure 4: I/O rate and power consumption (left) and response time distribution (right) for MSR-combined. 


ms while the same for SAS, EST, and IDT were 19.31, 
37.06, and 17.891 ms respectively. On average, EDT de- 
creased the dynamic power consumption by 13% rela- 
tive to its peak power, 55% relative to SAS and at least 
10% relative to IDT and EST. This dynamic power sav- 
ings result is likely to underestimate power savings ob- 
served in real deployments given that the workload was 
generated by consolidating multiple uncorrelated work- 
load traces, which tended to reduce the workload vari- 
ability that would enable dynamic power savings. Ad- 
ditionally, the experiment was done over the most active 
period, which required most devices to be active for per- 
formance. Further, all the configurations here are sized 
to meet the observed workload. Typically, however, stor- 
age purchases are made to accommodate future growth 
and hence over-provisioned to begin with, resulting in 
more dynamic power savings. 


Analysis. We illustrate how EDT achieves its supe- 
rior performance using two example extents chosen from 
the experiment and contrasting them with IDT. Figure 5 
shows the sequential and random IOPS over time for two 
extents along with the tier they are placed in. For extent 
A (top graph), both IDT and EDT move the extent from 
the SATA tier (the default initial location) to higher per- 
forming tiers when the total IOPS requirements increase. 
However, IDT allocates the SSD tier starting from hour 
3 on account of the exponentially weighted moving av- 
erage (EWMA) of total IOPS whereas EDT allocates the 
SSD tier only when the EWMA of random IOPS of the 
extent is high. Thus, EDT can better capitalize on the 
superior sequential performance of the SAS tier to min- 
imize capital costs during configuration and sustain per- 
formance during operation. Extent B (bottom graph) il- 
lustrates similar behavior during predominantly sequen- 
tial accesses. Further, both EDT and IDT rightly move 
extent B into the SATA tier when it becomes idle, aid- 
ing in power savings. Thus, EDT is successfully able to 
pick the best tier for an extent’s workload and relocate it 
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Figure 5: Contrasting extent migrations for EDT and 
IDT. The two upper lines denote extent placement for the 
different algorithms. Black is SSD tier, dark grey SAS and 
light grey SATA. 


when the requirements change. Regarding the overheads 
for this migrations, both EDT and IDT migrated around 
120 extents per epoch, using an average bandwidth of 42 
MB/s which only represents 3% of the total available. 

















Workload | Volumes Cap (GB) | Accessed 
server |hm, mds, prn, prxy, 1650 30% 
stg, ts, wdev, web 
data _| proj, rsch, usr 3719 34% 
srecntl |srcl, src2 904 29% 




















Table 4: Sub-workloads derived from MSR. 


Varying the workload. To analyze the sensitivity of 
the various algorithms to workload characteristics, we 
grouped volumes from the MSR workload as specified 
in Table 4 to create the server, data and srccntl (source 
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Figure 6: Replaying 6 hours of the MSR sub-workloads. First column is server, second data, and third srccnil. 







































































Workload | System | #of Disks | Energy | Cost combined, we did not consider it for further analysis. 

SAS (0, 6, 0) 103.8 W | $1950 
sae EST (2, 1, 2) 42.5 W | $1525 
IDT (2, 1, 1) 30.9W | $1355 

EDT 2,1) | 47.2W | $1250 Figure 6 shows EDT’s dynamic power consumption 

SAS (0, 10, 0) 173 W $3250 and extent distribution across tiers over time, as well as 

on EST (2, 2, 3) 71.4W | $2020 its response time distribution relative to IDT and SAS. 

IDT 1, 2, 4) 82 W $1760 First, unlike MSR-combined, these workloads do have 

EDT (1, 2, 4) 82 W $1760 substantial periods of lower utilization. Consequently, 

SAS (0, 6,0) | 103.8 W | $1950 in addition to improving the capital cost and peak power 

seat EST (2,3,1) | 65.5. W_ | $2005 consumption, EDT’s dynamic consolidation allows dy- 

IDT (2, 2, 2) 59.8 W_ | $1850 namic power savings of as much as 15-31% relative to 

EDT (2, 2, 2) 59.8 W_| $1850 its peak power across the three workloads. The extent 

Table 5: Configuration for MSR sub-workloads. distribution is quite different across the workloads. EDT 


Number of disks in each tier specified as (SSD, SAS, 


SATA). 


code control) workloads. 


each sub-workload using SAS, IDT, and EDT are pre- 
sented in Table 5. As with MSR-combined, the dynamic 
tiering solutions are able to configure both lower-cost 
and lower-energy systems when compared with SAS and 
EST. Further, in the case of the server workload, EDT 
optimizes the configured system cost with a single SSD 
relative to the two SSDs recommended using IDT. Given 
that EST had significantly inferior performance for MSR- 
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Configuration outcomes for 


uses the SSD tier substantially for the srccntl workload. 
IOPS-wise one would think that the workload should 
be completely consolidated to the SATA; however, EDT 
leverages the fact that the SSD tier offers improved en- 
ergy efficiency for up to 40% of the extents. The SAS 
tier was most used for server, in particular between hours 
2-4 when sequential activity dominates. The data work- 
load predominantly utilizes the SATA tier (as evidenced 
in the configuration outcome) since the IOPS per extent 
for most extents is very low, easily accommodated using 
SATA devices. Finally, in this equal performance config- 
uration experiment, the response time performance with 
EDT is either similar or better than the SAS and IDT 
schemes across the workloads. 
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(b) Varying Hot Extent Set 


Figure 7: Extent distribution and CDF for the adver- 
sarial workload. 


6.5 Adversarial Workloads 


Finally, we measure the impact of using EDT with work- 
loads completely different than the one it is provisioned 
for. We used the configuration obtained for the srccnil 
workload (in Table 5), and instead of the trace from that 
workload, we ran two separate synthetic workloads for 
two hours each: (1) a uniformly random workload at 
400 IOPS, where each I/O is issued to a random page 
in the system. (2) a workload at 500 IOPS, where I/Os 
are issued to a chosen set of 10 hot extents initially in the 
SATA tier and this set changes every minute. 

Figure 7 depicts the distribution of response times for 
both workloads. The uniformly random workload yields 
a 31% higher average response time for EDT and IDT 
compared to SAS. This can be attributed to the constant 
migration I/O moving extents away from the throttled 
SATA tier to both SAS and SSD tiers. Interestingly, we 
see only a 21% penalty for EDT in the second workload. 
Analysis shows that throttling of the newly active extents 
was promptly detected and the extents were migrated 
quickly to the SSD before they became cold. As illus- 
trated by these examples, EDT can handle unexpected 
workloads using its throttling detection/correction tech- 
niques without major performance penalties. 


7 Related Work 


We build on a rich body of related work in multiple areas. 
SSD-based storage architectures. Several products 
(IBM’s EasyTier [29], EMC’s FAST [17], 3PAR [25], 
and Compellent [23] systems) incorporate SSDs in stor- 
age tiering solutions. Since technical details of these 


approaches are not published, EDT is the first to pro- 
vide insight into design choices and components, de- 
tailed evaluation across workloads, and analysis of bene- 
fits and challenges in building SSD-based multi-tier sys- 
tems. Moreover, the publicly available documents of 
these products indicate that although they achieve cost 
savings and performance improvements, there is little fo- 
cus on tools aiding admins/customers to configure the 
right device mix for their workload or on incorporating 
algorithms that target dynamic energy savings. EDT ad- 
dresses these limitations. 

Another approach to leverage solid state technology 
in storage systems is to deploy flash devices as a cache 
between DRAM and HDD. NetApp’s FlashCache [24] 
which follows this approach cites cost reduction and per- 
formance improvement when coupled with SAS/SATA 
drives. Interestingly, Narayanan et al. [22] have argued 
that a SSD cache layer above SAS disks was generally 
not cost effective compared to an all SAS configuration 
at the same performance. We did find cost savings using 
SSD, but our system included much lower cost SATA 
disks to improve overall cost. Unfortunately, a detailed 
comparison between SSD caching and tiering would take 
a significant effort and more space than is available in 
this paper. However, our summary thoughts on the two 
architectures are: 1) SSD caching will utilize the SSD 
space more efficiently and can be more responsive to 
very dynamically changing workloads, but 2) SSD tier- 
ing enables both cost and energy savings even in enter- 
prise environments. 

Storage configuration (also referred to as provision- 
ing). Systems such as Minerva [3], Hippodrome [5], and 
DAD [6] address the problem of optimizing storage con- 
figuration by iteratively applying several steps such as 
configuring a low cost storage system, choosing RAID 
levels and other array parameters, and assigning entire 
volumes to arrays. EDT-CA’s focus on obtaining the 
right mix of storage devices to minimize cost is similar to 
the configuration step in these systems. The key differ- 
ence is that EDT-CA is inherently aware of, and utilizes 
the flexibility afforded by EDT’s dynamic extent place- 
ment. EDT’s data layout also operates at a much finer 
extent granularity. In EDT, we use a model to predict the 
utilization of an extent (given its bandwidth and random 
IOPS) that is similar in spirit to the previously proposed 
store level performance predictor [30] in its accounting 
for the differential load induced by sequential and ran- 
dom accesses to an extent. Finally, EDT-CA can be en- 
hanced to perform utility based provisioning as in [28]. 

Tiering. Migration-based storage tiering has been preva- 
lent in the industry for a long time in the form of Hierar- 
chical Storage Management systems, Information Life- 
cycle Management solutions, and other forms of coarse- 
grain tiering [2, 13, 15]. Most of these systems differ 
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from EDT in that they generally migrate data from upper 
to lower tiers, based on its age rather than on load. Fur- 
ther, these systems operate on volume, file system, or file 
objects rather than extents, and as such are suited more 
for file layer systems than block layer systems. Wilkes et 
al. propose AutoRAID [33], a storage system where ex- 
tents within volumes are migrated between faster RAID- 
1 arrays and slower RAID-5 arrays according to work- 
load and age. Significantly different algorithms for mi- 
gration decisions tuned to the specific two tiers are pro- 
posed. Additionally, AutoRAID does not consider the 
issue of correctly determining a device mixture to satisfy 
given workloads. 

Storage energy efficiency. EDT uses a consolida- 
tion algorithm to save energy in primary storage sys- 
tems. Other energy saving approaches that instead spin 
down a fraction of the available disk drives with active 
data [8, 10, 18, 20, 21, 26, 27, 31, 32, 34] either are 
not applicable in many primary storage systems due to 
the significant spin up latency, or require undesirable ca- 
pacity over-provisioning for redundant data. Work lever- 
aging Dynamic RPM capability (e.g., [12, 26, 37, 38]). 
is complementary to EDT. In fact, Hibernator [38] also 
leverages tiering but varies RPM setting of the drives to 
minimize energy. 


8 Discussion 


Extending the resource consumption model In this 
work we assumed RAID-0 arrays when estimating how 
much resource on a tier is consumed by a given work- 
load. In commercial applications of EDT, more sophisti- 
cated models will be needed to estimate resource con- 
sumption in arrays with different RAID levels. Such 
models do already exist in the industry, so we believe in- 
corporating this capability will be straightforward. Also, 
for the scope of this work, we assume that all arrays are 
at the same reliability level, and hence migrating data 
across arrays is not restricted. However, it is feasible to 
remove this constraint by observing policies to limit the 
migration targets of extents. Finally, the resource model 
may need be enhanced to better model the behavior of 
disks servicing multiple sequential IO streams in paral- 
lel. The current model does not account for degradation 
in sequential performance that may occur when a disk 
needs to service multiple sequential streams at once. 

Disk power fraction in the overall energy of a stor- 
age system. The chief dynamic energy-saving tech- 
nique proposed in this work is powering down empty 
disk drives. However, we find that in today’s commercial 
storage systems, disk drives typically consume ~50% of 
the total storage system energy [14] while the rest is con- 
sumed by other components which do not currently have 
the capability of varying their energy consumption ac- 
cording to workload. As these components overcome 
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this limitation, our energy-saving techniques can be ex- 
tended to include them, leading to a more energy propor- 
tional system and lower overall operating costs. 
Applicability. The target domain for EDT is primary 
storage systems where response time is critical. Archival 
applications where response time is not as critical may be 
better served with existing solutions using policy-based 
migration and power-saving storage such as spun-down 
disk or tape. Also, EDT will be most effective when the 
working set and I/O intensity are somewhat stable with 
some variation. When the workload is static, dynamic 
migration will not take place but consolidation will still 
be beneficial if the system is not capacity bound. 


9 Conclusion 


The increasing availability of solid-state drives has ush- 
ered in a new era of multi-tiered primary storage sys- 
tems. With EDT, we have formalized the configuration 
and dynamic tier management problems and have sys- 
tematically explored the design choices available when 
building such systems. We presented the design, im- 
plementation, and evaluation of EDT’s Configuration 
Adviser (EDT-CA) and Dynamic Tier Manager (EDT- 
DTM). EDT lowers capital cost by configuring less ex- 
pensive tiered storage and operating costs by dynami- 
cally optimizing power consumption via consolidation 
whenever feasible. We also demonstrated that EDT is 
successfully able to address the data migration overheads 
of dynamic tiering and respond rapidly and effectively to 
unexpected changes in the workload. 

Experimental results show EDT has significant bene- 
fit. Evaluation performed using both a production work- 
load and industry-standard synthetic workload revealed 
that multi-tier systems using EDT have a device mix that 
saves between 5% to 45% in cost, consume up to 54% 
less peak power, and an additional 15-30% lower dy- 
namic power (instantaneous power averaged over time), 
at a better or comparable performance compared to a ho- 
mogeneous SAS storage system. Experimental results 
also demonstrated that EDT is superior to simpler al- 
ternatives for extent-based tiering, providing lower cost 
and better performance, and consuming similar or lesser 
power. We hope that this study serves as a starting point 
for future work along the promising direction of multi- 
tiered enterprise storage systems. 
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